feat: HuggingFace Hub storage backend and CDC table properties#2375
Open
kszucs wants to merge 3 commits intoapache:mainfrom
Open
feat: HuggingFace Hub storage backend and CDC table properties#2375kszucs wants to merge 3 commits intoapache:mainfrom
kszucs wants to merge 3 commits intoapache:mainfrom
Conversation
6af3db8 to
0f6b02e
Compare
Adds two opt-in capabilities aimed at storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication. ## HuggingFace Hub storage backend New `opendal-hf` feature on `iceberg-storage-opendal` (off by default, included in `opendal-all`) that wires HuggingFace's OpenDAL service into `FileIO`. Paths use the form: hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo> where `repo_type` must be one of `models`, `datasets`, `spaces`, or `buckets` (XET-backed object storage). The prefix is mandatory — there is no implicit default. Configuration is passed via `FileIOBuilder` properties: - `hf.token` — API token (required for private repos / writes) - `hf.endpoint` — Hub endpoint, defaults to https://huggingface.co - `hf.revision` — fallback revision when a path has no `@<revision>` The `OpenDalResolvingStorage` recognises the `hf` scheme and lazily constructs a per-scheme storage instance. `delete_stream` groups paths by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same repo do not share an operator. ## CDC (content-defined chunking) table properties New table properties under the `parquet.cdc.*` namespace: - `parquet.cdc.min_chunk_size` (bytes) - `parquet.cdc.max_chunk_size` (bytes) - `parquet.cdc.norm_level` (gearhash bit adjustment, i32) CDC is implicitly enabled if any `parquet.cdc.*` property is present; unset fields fall back to `parquet::file::properties::CdcOptions::default()` so the Iceberg layer stays in sync with parquet's own defaults. A new `iceberg::writer::create_writer_properties()` helper builds parquet `WriterProperties` from `TableProperties`, applying CDC options when configured. The DataFusion physical write plan uses this helper, so tables created through DataFusion automatically pick up CDC settings. ## Other changes - `iceberg-storage-opendal`: migrated S3 credential plumbing from `reqsign 0.16` to `reqsign-aws-v4` / `reqsign-core` 3.0 (required by the opendal version that adds HF support). `CustomAwsCredentialLoader` now wraps any `ProvideCredential<Credential = AwsCredential>` rather than `Arc<dyn AwsCredentialLoad>`. - `OpenDalResolvingStorage`: replaced `opendal::Scheme` with a canonical `&'static str` cache key, removing the dependency on opendal's `Scheme` enum (which no longer exposes all needed variants in 0.56). - `OpenDalStorage::remove_prefix`: switched from `remove_all` to `delete_with(...).recursive(true)` for the new opendal API. ## Tests - Rust unit tests for `HfUri` parsing (repo types, revisions including `refs/convert/parquet` and `refs/pr/N`, percent-encoded refs, edge cases, rejection of paths missing the repo-type prefix) and CDC property parsing. - Rust integration tests in `crates/storage/opendal/tests/file_io_hf_test.rs` guarded on `HF_OPENDAL_TOKEN`, `HF_OPENDAL_BUCKET`, `HF_OPENDAL_DATASET` env vars; tests skip if any required env var is unset. - Python tests in `bindings/python/tests/test_hf_and_cdc.py` covering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped without `HF_OPENDAL_TOKEN` and `HF_OPENDAL_TABLE_METADATA`). ## Dependencies `opendal` is pinned to a git revision of apache/opendal that includes the `services-hf` backend. Once a release containing HF support is published on crates.io, this should be flipped back to a version pin.
The error message changed from reqsign 0.16's 'no valid credential found and anonymous access is not allowed' to reqsign-core 3.0's 'failed to load signing credential'. Assert on the stable substring 'credential' instead.
kszucs
commented
Apr 27, 2026
| murmur3 = "0.5.2" | ||
| once_cell = "1.20" | ||
| opendal = "0.55.0" | ||
| opendal = { git = "https://github.com/apache/opendal", rev = "f385a8e5c598dc36fe869a55175fb1400148f3a8" } |
Member
Author
There was a problem hiding this comment.
OpenDAL 0.56 hasn't been released yet which ships HF support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds two opt-in capabilities aimed at storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.
HuggingFace Hub storage backend
New
opendal-hffeature oniceberg-storage-opendal(off by default, included inopendal-all) that wires HuggingFace's OpenDAL service intoFileIO. Paths use the form:hf://[<repo_type>/]<owner>/<repo>[@<revision>]/<path_in_repo>where
repo_typeis one ofmodels,datasets,spaces, orbuckets. Configuration is passed viaFileIOBuilderproperties:hf.token— API token (required for private repos / writes)hf.endpoint— Hub endpoint, defaults to https://huggingface.cohf.revision— fallback revision when a path has no@<revision>The
OpenDalResolvingStoragerecognises thehfscheme and lazily constructs a per-scheme storage instance.delete_streamgroups paths by<repo_type>/<repo_id>so that bucket and dataset paths to the same repo do not share an operator.CDC (content-defined chunking) table properties
New table properties under the
parquet.cdc.*namespace:parquet.cdc.min_chunk_size(bytes)parquet.cdc.max_chunk_size(bytes)parquet.cdc.norm_level(gearhash bit adjustment, i32)CDC is implicitly enabled if any
parquet.cdc.*property is present; unset fields fall back toparquet::file::properties::CdcOptions::default()so the Iceberg layer stays in sync with parquet's own defaults. A newiceberg::writer::create_writer_properties()helper builds parquetWriterPropertiesfromTableProperties, applying CDC options when configured. The DataFusion physical write plan uses this helper, so tables created through DataFusion automatically pick up CDC settings.Other changes
iceberg-storage-opendal: migrated S3 credential plumbing fromreqsign 0.16toreqsign-aws-v4/reqsign-core3.0 (required by the opendal version that adds HF support).CustomAwsCredentialLoadernow wraps any
ProvideCredential<Credential = AwsCredential>rather thanArc<dyn AwsCredentialLoad>.OpenDalResolvingStorage: replacedopendal::Schemewith a canonical&'static strcache key, removing the dependency on opendal'sSchemeenum (which no longer exposes all needed variants in 0.56).OpenDalStorage::remove_prefix: switched fromremove_alltodelete_with(...).recursive(true)for the new opendal API.Tests
HfUriparsing (repo types, revisions includingrefs/convert/parquetandrefs/pr/N, percent-encoded refs, edge cases) and CDC property parsing.crates/storage/opendal/tests/file_io_hf_test.rsguarded onHF_OPENDAL_TOKEN,HF_OPENDAL_BUCKET,HF_OPENDAL_DATASETenv vars; tests skip if any required env var is unset.bindings/python/tests/test_hf_and_cdc.pycovering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped withoutHF_OPENDAL_TOKENandHF_OPENDAL_TABLE_METADATA).Dependencies
opendalis pinned to a git revision of apache/opendal that includes theservices-hfbackend. Once a release containing HF support is published on crates.io, this should be flipped back to a version pin.