Skip to content

feat: HuggingFace Hub storage backend and CDC table properties#2375

Open
kszucs wants to merge 3 commits intoapache:mainfrom
kszucs:opendal-hf
Open

feat: HuggingFace Hub storage backend and CDC table properties#2375
kszucs wants to merge 3 commits intoapache:mainfrom
kszucs:opendal-hf

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Apr 27, 2026

Adds two opt-in capabilities aimed at storing Iceberg tables on HuggingFace Hub with content-defined chunking for efficient deduplication.

HuggingFace Hub storage backend

New opendal-hf feature on iceberg-storage-opendal (off by default, included in opendal-all) that wires HuggingFace's OpenDAL service into FileIO. Paths use the form:

hf://[<repo_type>/]<owner>/<repo>[@<revision>]/<path_in_repo>

where repo_type is one of models, datasets, spaces, or buckets. Configuration is passed via FileIOBuilder properties:

  • hf.token — API token (required for private repos / writes)
  • hf.endpoint — Hub endpoint, defaults to https://huggingface.co
  • hf.revision — fallback revision when a path has no @<revision>

The OpenDalResolvingStorage recognises the hf scheme and lazily constructs a per-scheme storage instance. delete_stream groups paths by <repo_type>/<repo_id> so that bucket and dataset paths to the same repo do not share an operator.

CDC (content-defined chunking) table properties

New table properties under the parquet.cdc.* namespace:

  • parquet.cdc.min_chunk_size (bytes)
  • parquet.cdc.max_chunk_size (bytes)
  • parquet.cdc.norm_level (gearhash bit adjustment, i32)

CDC is implicitly enabled if any parquet.cdc.* property is present; unset fields fall back to parquet::file::properties::CdcOptions::default() so the Iceberg layer stays in sync with parquet's own defaults. A new iceberg::writer::create_writer_properties() helper builds parquet WriterProperties from TableProperties, applying CDC options when configured. The DataFusion physical write plan uses this helper, so tables created through DataFusion automatically pick up CDC settings.

Other changes

  • iceberg-storage-opendal: migrated S3 credential plumbing from reqsign 0.16 to reqsign-aws-v4 / reqsign-core 3.0 (required by the opendal version that adds HF support). CustomAwsCredentialLoader
    now wraps any ProvideCredential<Credential = AwsCredential> rather than Arc<dyn AwsCredentialLoad>.
  • OpenDalResolvingStorage: replaced opendal::Scheme with a canonical &'static str cache key, removing the dependency on opendal's Scheme enum (which no longer exposes all needed variants in 0.56).
  • OpenDalStorage::remove_prefix: switched from remove_all to delete_with(...).recursive(true) for the new opendal API.

Tests

  • Rust unit tests for HfUri parsing (repo types, revisions including refs/convert/parquet and refs/pr/N, percent-encoded refs, edge cases) and CDC property parsing.
  • Rust integration tests in crates/storage/opendal/tests/file_io_hf_test.rs guarded on HF_OPENDAL_TOKEN, HF_OPENDAL_BUCKET, HF_OPENDAL_DATASET env vars; tests skip if any required env var is unset.
  • Python tests in bindings/python/tests/test_hf_and_cdc.py covering CDC property persistence, PyIceberg writes with CDC, DataFusion read-back, and HF credentials end-to-end (skipped without HF_OPENDAL_TOKEN and
    HF_OPENDAL_TABLE_METADATA).

Dependencies

opendal is pinned to a git revision of apache/opendal that includes the services-hf backend. Once a release containing HF support is published on crates.io, this should be flipped back to a version pin.

@kszucs kszucs force-pushed the opendal-hf branch 2 times, most recently from 6af3db8 to 0f6b02e Compare April 27, 2026 08:15
Adds two opt-in capabilities aimed at storing Iceberg tables on
HuggingFace Hub with content-defined chunking for efficient deduplication.

## HuggingFace Hub storage backend

New `opendal-hf` feature on `iceberg-storage-opendal` (off by default,
included in `opendal-all`) that wires HuggingFace's OpenDAL service into
`FileIO`. Paths use the form:

  hf://<repo_type>/<owner>/<repo>[@<revision>]/<path_in_repo>

where `repo_type` must be one of `models`, `datasets`, `spaces`, or
`buckets` (XET-backed object storage). The prefix is mandatory — there
is no implicit default. Configuration is passed via `FileIOBuilder`
properties:

  - `hf.token`     — API token (required for private repos / writes)
  - `hf.endpoint`  — Hub endpoint, defaults to https://huggingface.co
  - `hf.revision`  — fallback revision when a path has no `@<revision>`

The `OpenDalResolvingStorage` recognises the `hf` scheme and lazily
constructs a per-scheme storage instance. `delete_stream` groups paths
by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same
repo do not share an operator.

## CDC (content-defined chunking) table properties

New table properties under the `parquet.cdc.*` namespace:

  - `parquet.cdc.min_chunk_size` (bytes)
  - `parquet.cdc.max_chunk_size` (bytes)
  - `parquet.cdc.norm_level`     (gearhash bit adjustment, i32)

CDC is implicitly enabled if any `parquet.cdc.*` property is present;
unset fields fall back to `parquet::file::properties::CdcOptions::default()`
so the Iceberg layer stays in sync with parquet's own defaults. A new
`iceberg::writer::create_writer_properties()` helper builds parquet
`WriterProperties` from `TableProperties`, applying CDC options when
configured. The DataFusion physical write plan uses this helper, so
tables created through DataFusion automatically pick up CDC settings.

## Other changes

- `iceberg-storage-opendal`: migrated S3 credential plumbing from
  `reqsign 0.16` to `reqsign-aws-v4` / `reqsign-core` 3.0 (required
  by the opendal version that adds HF support). `CustomAwsCredentialLoader`
  now wraps any `ProvideCredential<Credential = AwsCredential>` rather
  than `Arc<dyn AwsCredentialLoad>`.
- `OpenDalResolvingStorage`: replaced `opendal::Scheme` with a canonical
  `&'static str` cache key, removing the dependency on opendal's `Scheme`
  enum (which no longer exposes all needed variants in 0.56).
- `OpenDalStorage::remove_prefix`: switched from `remove_all` to
  `delete_with(...).recursive(true)` for the new opendal API.

## Tests

- Rust unit tests for `HfUri` parsing (repo types, revisions including
  `refs/convert/parquet` and `refs/pr/N`, percent-encoded refs, edge
  cases, rejection of paths missing the repo-type prefix) and CDC
  property parsing.
- Rust integration tests in `crates/storage/opendal/tests/file_io_hf_test.rs`
  guarded on `HF_OPENDAL_TOKEN`, `HF_OPENDAL_BUCKET`, `HF_OPENDAL_DATASET`
  env vars; tests skip if any required env var is unset.
- Python tests in `bindings/python/tests/test_hf_and_cdc.py` covering CDC
  property persistence, PyIceberg writes with CDC, DataFusion read-back,
  and HF credentials end-to-end (skipped without `HF_OPENDAL_TOKEN` and
  `HF_OPENDAL_TABLE_METADATA`).

## Dependencies

`opendal` is pinned to a git revision of apache/opendal that includes
the `services-hf` backend. Once a release containing HF support is
published on crates.io, this should be flipped back to a version pin.
kszucs added 2 commits April 27, 2026 10:23
The error message changed from reqsign 0.16's
'no valid credential found and anonymous access is not allowed'
to reqsign-core 3.0's 'failed to load signing credential'.
Assert on the stable substring 'credential' instead.
Comment thread Cargo.toml
murmur3 = "0.5.2"
once_cell = "1.20"
opendal = "0.55.0"
opendal = { git = "https://github.com/apache/opendal", rev = "f385a8e5c598dc36fe869a55175fb1400148f3a8" }
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenDAL 0.56 hasn't been released yet which ships HF support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant