-
Notifications
You must be signed in to change notification settings - Fork 88
feat: integration s3 with arrow filesystem #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for adding this!
Yes, I believe this is worth doing. I supposed to reuse
There is a related discussion with regard to |
|
I recommend using MinIO. It is relatively stable and suitable for the current project development phase. Once the community reaches a consensus, the cost of replacing MinIO will not be high. |
|
I think it is fine to use minio at this moment to unblock us. Let me know what you think on my proposed approach above. We might also need to add a FileIO registry to provide default implementation on us and enable users to override their own implementations of s3 and others. The key in the FileIO registry can be associated with table property |
|
We may also need to add top-level CMake options like |
FYI, there is a PR to replace MinIO with RustFS, apache/iceberg#14928 |
ArrowFileSystemFileIO is ok, I referenced MakeLocalFileIO and implemented a simple MakeS3FileIO interface using arrowfilesystem.
you mean this It's equivalent to setting the io-impl string in the catalog's properties. Then, RestCatalog the FileIORegistry looks up the implementation in the io-impl map. Is that roughly how it works? If so, I can try implementing some simple code to see if it's correct. |
|
Yes, I think it looks reasonable. |
8436b72 to
5197fa9
Compare
|
The current code is only simple implemented. Could you help me check it is ok? |
|
Thanks for the update! I'm busy with some internal stuff these days. Will try to review this as soon as possible. |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took a quick pass on the latest commit. Can we simplify the implementation like this:
- Use a CMake option to enable S3.
- Define reserved iceberg properties for S3 and add functions to convert them to Arrow S3 options.
- To create a concrete S3FileIO, using Arrow API to create a S3FileSystem and wrap it by ArrowFileSystemFileIO.
- Register the factory to create FileIO of S3 to the registry before use.
- Add a file io utility to create the FileIO instance based on various condition.
| # Work around undefined symbol: arrow::ipc::ReadSchema(arrow::io::InputStream*, arrow::ipc::DictionaryMemo*) | ||
| set(ARROW_IPC ON) | ||
| set(ARROW_FILESYSTEM ON) | ||
| set(ARROW_S3 ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a cmake option ICEBERG_S3 and only toggle on ARROW_S3 when ICEBERG_S3 is on?
|
|
||
| #include <arrow/filesystem/filesystem.h> | ||
| #include <arrow/filesystem/localfs.h> | ||
| #if __has_include(<arrow/filesystem/s3fs.h>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add ICEBERG_S3 option, we don't need to deal with this check.
| impl_name = io_impl->second; | ||
| } else { | ||
| // Use default based on warehouse URI scheme | ||
| if (warehouse.rfind("s3://", 0) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using rfind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, shouldn't we use uri instead of warehouse property?
| /// This implementation is thread-safe as it creates a new FileSystem instance | ||
| /// for each operation. However, it may be less efficient than caching the | ||
| /// FileSystem. S3 initialization is done once per process. | ||
| class ArrowUriFileIO : public FileIO { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this instead of reusing ArrowFIleSystemFileIO?
| /// | ||
| /// \param properties The configuration properties map. | ||
| /// \return Configured S3Options. | ||
| ::arrow::fs::S3Options ConfigureS3Options( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is something that we need.
| /// This overload automatically creates an appropriate FileIO based on the "io-impl" | ||
| /// property or the warehouse location URI scheme. | ||
| /// | ||
| /// FileIO selection logic: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to add a iceberg/util/file_io_util.h to handle this logic and support reusing. Please note that Arrow Filesystem support is only available in the iceberg-bundle library, so we can only talk to the FileIO registry to create an FileIO instance.
| ::arrow::fs::S3Options options; | ||
|
|
||
| // Configure credentials | ||
| auto access_key_it = properties.find(S3Properties::kAccessKeyId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is S3Properties defined?
I have implemented Arrow FileSystem to access S3, but I'm still not sure if it meets the requirements.
There are still task or question to complete for the current PR, and it is not ready for merging yet.
Question:
Currently, the object storage options include Azure, AWS, and GCS. I have chosen AWS as the implementation for now is ok?
Task:
I need to deploy MinIO to facilitate testing access to S3, but I'm not sure where it would be best to set it up?