Original link: https://xuanwo.io/2023/02-how-opendal-read-data/
With the continuous development of the OpenDAL community, new abstractions are constantly increasing, which brings a lot of burden for new contributors to participate in the development. Many maintainers hope to have a deeper understanding of the internal implementation of OpenDAL. At the same time, the core design of OpenDAL has not changed significantly for a long time, making it possible to write a series of internal implementations. I think it’s time to write a series of articles about the internal implementation of OpenDAL, explaining how OpenDAL is designed, implemented and extended from the maintainer’s point of view. As OpenDAL v0.40 is about to be released, I hope this series of articles can better help the community understand the past, grasp the present, and determine the future.
The first article will first talk about the most commonly used data reading function of OpenDAL. I will start from the outermost interface, and then gradually expand according to the calling order of OpenDAL. let’s start!
overall framework
Before we start to introduce the specific OpenDAL interface, let’s get familiar with the OpenDAL project first.
OpenDAL is an Apache Incubator project designed to help users access data conveniently and efficiently in a unified way from various storage services. Its project vision is “free access to data”:
- Free from services: Any service can be freely accessed through native interfaces
- Free from implementations: Regardless of the underlying implementation, it can be called in a uniform way
- Free to integrate: Can freely integrate with various services and languages
- Free to zero cost: users do not need to pay for functions that are not used
Based on this concept, OpenDAL Rust Core can be mainly divided into the following components:
- Operator: the outer interface exposed to users
- Layers: concrete implementation of different middleware
- Services: the specific implementation of different services
So from a macro point of view, OpenDAL’s data reading call stack will look like this:
All Layers and Services have implemented a unified Accessor interface, and all type information will be erased when constructing an Operator. For Operator, no matter what service the user uses or how much middleware is added, all invocation logic is consistent. This design splits the OpenDAL API into two layers: Public API and Raw API. The Public API is directly exposed to users and provides an easy-to-use upper-layer interface, while the Raw API is provided for OpenDAL internal developers to maintain a unified internal interface, and provides some convenience implementations.
Operator
OpenDAL’s Operator API will follow a consistent calling paradigm as much as possible to reduce users’ learning and usage costs. Taking read
as an example, OpenDAL provides the following APIs:
-
op.read(path)
: read the entire content of the specified file -
op.reader(path)
: Create a Reader for streaming reading -
op.read_with(path).range(1..1024)
: read the file content using the specified parameter, such as range -
op.reader_with(path).range(1..1024)
: Use the specified parameters to create a Reader for streaming reading
It is not difficult to see that read
is more like a syntactic sugar, which is used to facilitate users to quickly read files without considering various traits such as AsyncRead
. reader
on the other hand, gives users more flexibility and implements traits widely used by communities such as AsyncSeek
and AsyncRead
, allowing users to read data more flexibly. read_with
and reader_with
help users specify various parameters in a more natural way through the Future Builder series of functions.
The logic inside the Operator will look like this:
Its main job is to encapsulate the interface for users:
- Complete the construction of
OpRead
- Call the
read
function provided byAccessor
- Wrap the returned value as
Reader
and implement interfaces such asAsyncSeek
andAsyncRead
on the basis ofReader
Layers
Here is a little hidden secret that OpenDAL will automatically add some Layers to the Service to implement some internal logic. As of the completion of this article, the Layers automatically added by OpenDAL include:
-
ErrorContextLayer
: Inject context information for all errors returned by Operation, such asscheme
,path
, etc. -
CompleteLayer
: Complete the necessary capabilities for the service, such as adding seek support for s3 -
TypeEraseLayer
: implement type erasure, erase the associated types inAccessor
uniformly, so that users do not need to carry generic parameters when using
ErrorContextLayer
and TypeEraseLayer
here are relatively simple and will not be described in detail. The focus is on CompleteLayer
, which aims to add seek
support to Reader
returned by OpenDAL in a zero-overhead manner, so that users do not need to repeat the implementation. OpenDAL returns Reader
and SeekableReader
through different function calls in the early version, but the actual feedback from users is not very good, almost all users are using SeekableReader
. Therefore, OpenDAL added seek support as the first priority to the internal Read
trait in the subsequent refactoring:
pub trait Read: Unpin + Send + Sync {
/// Read bytes asynchronously. fn poll_read ( & mut self, cx: & mut Context < '_ > , buf: & mut [ u8 ]) -> Poll < Result < usize >> ;
/// Seek asynchronously. /// /// Returns `Unsupported` error if underlying reader doesn't support seek. fn poll_seek ( & mut self, cx: & mut Context < '_ > , pos: io ::SeekFrom) -> Poll < Result < u64 >> ;
/// Stream [`Bytes`] from underlying reader. /// /// Returns `Unsupported` error if underlying reader doesn't support stream. /// /// This API exists for avoiding bytes copying inside async runtime. /// Users can poll bytes from underlying reader and decide when to /// read/consume them. fn poll_next ( & mut self, cx: & mut Context < '_ > ) -> Poll < Option < Result < Bytes >>> ;
}
To realize the reading capability of a service in OpenDAL, you need to implement this trait, which is an internal interface and will not be directly exposed to users. Among them:
-
poll_read
is the most basic requirement, and all services must implement this interface. - When the service natively supports
seek
,poll_seek
can be implemented, and OpenDAL will perform correct dispatch, such as local fs; - And when the service natively supports
next
, that is, returns streaming Bytes,poll_next
can be implemented. For example, HTTP-based services, their bottom layer is a TCP Stream, and hyper will encapsulate it into a bytes stream.
Through Read
trait, OpenDAL ensures that all services can expose their native support capabilities as much as possible, so as to provide efficient reading for different services.
On the basis of this trait, OpenDAL will complete according to the capabilities supported by each service:
- Both seek/next are supported: return directly
- does not support next: use
StreamableReader
for encapsulation to simulate next support - Seek is not supported: use
ByRangeSeekableReader
for encapsulation to simulate seek support - Seek/next are not supported: perform two encapsulations at the same time
ByRangeSeekableReader
mainly uses the ability of the service to support range read. When the user seeks, it drops the current reader and initiates a new request at the specified position.
OpenDAL exposes a unified Reader implementation through CompleteLayer
. Users don’t need to consider whether the underlying service supports seek. OpenDAL will always choose the optimal way to initiate the request.
Services
After the completion of Layers, it is time to call the specific implementation of Service. Here, the two most common types of services, fs
and s3
, are used to illustrate how to read data.
Service fs
tokio::fs::File
implements tokio::AsyncRead
and tokio::AsyncSeek
. By using async_compat::Compat
, we convert it into futures::AsyncRead
and futures::AsyncSeek
. On this basis, we provide the built-in function oio::into_read_from_file
to convert it into a type that implements oio::Read
, and the final type name is: oio::FromFileReader<Compat<tokio::fs::File>>
.
There is nothing particularly complicated in the implementation of oio::into_read_from_file
. Read and seek are basically calling functions provided by the incoming File type. The more troublesome part is the correct handling of seek and range: seek to the right of the range is allowed, and no error will be reported at this time, and read will only return empty, but seeking to the left of the range is illegal, and Reader must return InvalidInput
In order to facilitate the correct processing of the upper layer.
Interesting history: there was a problem when this block was implemented, and it was discovered in the fuzz test.
Services s3
S3 is an HTTP-based service. Opendal provides a large number of HTTP-based packages to help developers reuse logic. It only needs to construct the request and return the constructed Body. OpenDAL Raw API encapsulates a set of reqwest-based interfaces, and the HTTP GET interface will return a Response<IncomingAsyncBody>
:
/// IncomingAsyncBody carries the content returned by remote servers. pub struct IncomingAsyncBody {
/// # TODO /// /// hyper returns `impl Stream<Item = crate::Result<Bytes>>` but we can't /// write the types in stable. So we will box here. /// /// After [TAIT](https://rust-lang.github.io/rfcs/2515-type_alias_impl_trait.html) /// has been stable, we can change `IncomingAsyncBody` into `IncomingAsyncBody<S>`. inner: oio ::Streamer,
size: Option < u64 > ,
consumed: u64 ,
chunk: Option < Bytes > ,
}
The stream contained in the body is the bytes stream returned by reqwest, and based on this, opendal has implemented content length check and read support.
Here is an additional small pit about reqwest/hyper: reqwets and hyper do not check the returned content length, so an illegal server may return a data volume that does not match the expected content length instead of reporting an error, which will lead to data behavior Not as expected. OpenDAL specifically adds checks here, returns ContentIncomplete
when the data is insufficient, and returns ContentTruncated
when the data exceeds expectations, to prevent users from receiving illegal data.
Summarize
This article introduces how OpenDAL implements data reading from top to bottom:
- Operator is responsible for exposing an easy-to-use interface to users
- Layers are responsible for completing the capabilities of the service
- Services is responsible for the specific implementation of different services
In the whole link, OpenDAL follows the principle of zero overhead as much as possible, giving priority to using the service’s native capabilities, and then considering other methods for simulation, and finally returns an unsupported error. Through the design of these three layers, users can easily call op.read(path)
to access data in any storage service without knowing the details of the underlying services, and without accessing SDKs of different services.
Here it is: How OpenDAL read data freely!
This article is transferred from: https://xuanwo.io/2023/02-how-opendal-read-data/
This site is only for collection, and the copyright belongs to the original author.