feat: integration s3 with arrow filesystem #548

MisterRaindrop · 2026-01-29T11:41:27Z

I have implemented Arrow FileSystem to access S3, but I'm still not sure if it meets the requirements.

There are still task or question to complete for the current PR, and it is not ready for merging yet.

Question:
Currently, the object storage options include Azure, AWS, and GCS. I have chosen AWS as the implementation for now is ok?

Task:
I need to deploy MinIO to facilitate testing access to S3, but I'm not sure where it would be best to set it up?

wgtmac · 2026-01-29T14:15:56Z

Thanks for adding this!

I have chosen AWS as the implementation for now is ok?

Yes, I believe this is worth doing. I supposed to reuse ArrowFileSystemFileIO by passing an arrow::FileSystem of S3 which is supported by Arrow if built with ARROW_S3=ON. But I haven't explored it yet.

I need to deploy MinIO to facilitate testing access to S3

There is a related discussion with regard to minio's recent license change: https://lists.apache.org/thread/vnw9jonmfcsz6bwojhfch1nmywyl50h3. I'm not sure if there is any good alternative to test it.

MisterRaindrop · 2026-01-30T01:44:42Z

I recommend using MinIO. It is relatively stable and suitable for the current project development phase. Once the community reaches a consensus, the cost of replacing MinIO will not be high.

wgtmac · 2026-01-31T02:06:53Z

I think it is fine to use minio at this moment to unblock us. Let me know what you think on my proposed approach above. We might also need to add a FileIO registry to provide default implementation on us and enable users to override their own implementations of s3 and others. The key in the FileIO registry can be associated with table property io-impl.

wgtmac · 2026-01-31T02:14:06Z

We may also need to add top-level CMake options like ICEBERG_S3, ICEBERG_AZURE, ICEBERG_GCS to mirror Arrow equivalents and vendor Arrow with ARROW_S3=ON. This might be tricky in the CI because we need to deal with Arrow's dependency on AWS SDK and others.

zhjwpku · 2026-01-31T09:17:28Z

I recommend using MinIO. It is relatively stable and suitable for the current project development phase. Once the community reaches a consensus, the cost of replacing MinIO will not be high.

FYI, there is a PR to replace MinIO with RustFS, apache/iceberg#14928

MisterRaindrop · 2026-02-02T05:50:32Z

Yes, I believe this is worth doing. I supposed to reuse ArrowFileSystemFileIO by passing an arrow::FileSystem of S3 which is supported by Arrow if built with ARROW_S3=ON. But I haven't explored it yet.

ArrowFileSystemFileIO is ok, I referenced MakeLocalFileIO and implemented a simple MakeS3FileIO interface using arrowfilesystem.

I think it is fine to use minio at this moment to unblock us. Let me know what you think on my proposed approach above. We might also need to add a FileIO registry to provide default implementation on us and enable users to override their own implementations of s3 and others. The key in the FileIO registry can be associated with table property io-impl.

you mean this io-impl=org.apache.iceberg.aws.s3.S3FileIO ?

It's equivalent to setting the io-impl string in the catalog's properties. Then, RestCatalog the FileIORegistry looks up the implementation in the io-impl map. Is that roughly how it works? If so, I can try implementing some simple code to see if it's correct.

wgtmac · 2026-02-02T06:49:36Z

Yes, I think it looks reasonable.

MisterRaindrop · 2026-02-03T07:47:11Z

The current code is only simple implemented. Could you help me check it is ok?
@wgtmac

wgtmac · 2026-02-03T10:07:14Z

Thanks for the update! I'm busy with some internal stuff these days. Will try to review this as soon as possible.

wgtmac

I just took a quick pass on the latest commit. Can we simplify the implementation like this:

Use a CMake option to enable S3.
Define reserved iceberg properties for S3 and add functions to convert them to Arrow S3 options.
To create a concrete S3FileIO, using Arrow API to create a S3FileSystem and wrap it by ArrowFileSystemFileIO.
Register the factory to create FileIO of S3 to the registry before use.
Add a file io utility to create the FileIO instance based on various condition.

wgtmac · 2026-02-09T08:24:09Z

cmake_modules/IcebergThirdpartyToolchain.cmake

  # Work around undefined symbol: arrow::ipc::ReadSchema(arrow::io::InputStream*, arrow::ipc::DictionaryMemo*)
  set(ARROW_IPC ON)
  set(ARROW_FILESYSTEM ON)
+  set(ARROW_S3 ON)


Can we add a cmake option ICEBERG_S3 and only toggle on ARROW_S3 when ICEBERG_S3 is on?

wgtmac · 2026-02-09T09:00:15Z

src/iceberg/arrow/arrow_s3_file_io.cc

+
+#include <arrow/filesystem/filesystem.h>
+#include <arrow/filesystem/localfs.h>
+#if __has_include(<arrow/filesystem/s3fs.h>)


If we add ICEBERG_S3 option, we don't need to deal with this check.

wgtmac · 2026-02-09T09:03:27Z

src/iceberg/catalog/rest/rest_catalog.cc

+    impl_name = io_impl->second;
+  } else {
+    // Use default based on warehouse URI scheme
+    if (warehouse.rfind("s3://", 0) == 0) {


Why using rfind?

BTW, shouldn't we use uri instead of warehouse property?

wgtmac · 2026-02-09T09:21:05Z

src/iceberg/arrow/arrow_s3_file_io.cc

+/// This implementation is thread-safe as it creates a new FileSystem instance
+/// for each operation. However, it may be less efficient than caching the
+/// FileSystem. S3 initialization is done once per process.
+class ArrowUriFileIO : public FileIO {


Why do we need this instead of reusing ArrowFIleSystemFileIO?

wgtmac · 2026-02-09T09:30:46Z

src/iceberg/arrow/arrow_s3_file_io.cc

+///
+/// \param properties The configuration properties map.
+/// \return Configured S3Options.
+::arrow::fs::S3Options ConfigureS3Options(


I agree this is something that we need.

wgtmac · 2026-02-09T09:43:55Z

src/iceberg/catalog/rest/rest_catalog.h

+  /// This overload automatically creates an appropriate FileIO based on the "io-impl"
+  /// property or the warehouse location URI scheme.
+  ///
+  /// FileIO selection logic:


It is better to add a iceberg/util/file_io_util.h to handle this logic and support reusing. Please note that Arrow Filesystem support is only available in the iceberg-bundle library, so we can only talk to the FileIO registry to create an FileIO instance.

wgtmac · 2026-02-09T09:47:15Z

src/iceberg/arrow/arrow_s3_file_io.cc

+  ::arrow::fs::S3Options options;
+
+  // Configure credentials
+  auto access_key_it = properties.find(S3Properties::kAccessKeyId);


What is S3Properties defined?

feat: integration s3 with arrow filesystem

5197fa9

MisterRaindrop force-pushed the arrowfs_with_s3 branch from 8436b72 to 5197fa9 Compare February 3, 2026 07:41

wgtmac reviewed Feb 9, 2026

View reviewed changes

feat: integration s3 with arrow filesystem #548

Are you sure you want to change the base?

feat: integration s3 with arrow filesystem #548

Uh oh!

Conversation

MisterRaindrop commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Jan 29, 2026

Uh oh!

MisterRaindrop commented Jan 30, 2026

Uh oh!

wgtmac commented Jan 31, 2026

Uh oh!

wgtmac commented Jan 31, 2026

Uh oh!

zhjwpku commented Jan 31, 2026

Uh oh!

MisterRaindrop commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Feb 2, 2026

Uh oh!

MisterRaindrop commented Feb 3, 2026

Uh oh!

wgtmac commented Feb 3, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MisterRaindrop commented Jan 29, 2026 •

edited

Loading

MisterRaindrop commented Feb 2, 2026 •

edited

Loading