Add io-uring based ObjectStore for local file I/O#21673
Add io-uring based ObjectStore for local file I/O#21673Dandandan wants to merge 5 commits intoapache:mainfrom
Conversation
Introduces `datafusion-object-store-iouring`, a new crate that provides an `IoUringObjectStore` using Linux's io_uring interface for high-performance local file reads. A dedicated thread runs an io_uring event loop, and read requests (`get_opts`, `get_ranges`) are dispatched via channels — enabling batched syscalls where multiple byte-range reads (e.g., Parquet column chunks) are submitted in a single `io_uring_enter()` call instead of individual `pread()` calls. Key design: - Dedicated io_uring worker thread with a 256-entry submission queue - Unbounded mpsc channel for requests, oneshot channels for responses - Range reads batched per-request; chunked if exceeding ring capacity - Write/list/copy/delete operations delegated to LocalFileSystem - On non-Linux platforms, all operations fall back to LocalFileSystem - Feature flag `io-uring` on `datafusion-execution` to opt in Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resolve merge conflicts with upstream (version 53.0.0, object_store 0.13.2) - Update ObjectStore trait impl for 0.13.2 API changes: - copy/copy_if_not_exists → copy_opts(CopyOptions) - delete → delete_stream (required method) - PutMultipartOpts → PutMultipartOptions - Import ObjectStoreExt for head() convenience method - Enable io-uring feature by default in datafusion-execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (0bf32dd) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
- Fix missing `StreamExt` import for `.boxed()` on Linux code path - Fix `GetResult.range` type: `Range<u64>` in object_store 0.13.2 - Add `io-uring` feature to datafusion core, forwarding to execution - Add to core's default features so benchmarks get it automatically Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runners and Docker containers often block io_uring_setup via seccomp filters (EPERM). Instead of failing hard, probe availability at construction time and gracefully fall back to LocalFileSystem for all read operations when io_uring cannot be initialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing worktree-io-uring-object-store (87539f6) to 7bfa3fb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
- Remove unnecessary `as u64` cast (meta.size is already u64 in 0.13.2) - Allow clippy::result_large_err on execute_read_ranges (object_store::Error is large by design) - Fix broken rustdoc links to LocalFileSystem (cfg-gated away when io-uring feature is enabled) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds an init container that writes a seccomp profile (default allowlist + io_uring_setup/enter/register) to the node's kubelet seccomp dir. The runner container then references it via Localhost seccomp type. No DaemonSet or cluster-wide changes needed — it's self-contained in each benchmark Job.
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds a checked-in seccomp profile (services/seccomp/io-uring-allowed.json) that's the standard containerd default allowlist plus three syscalls: io_uring_setup, io_uring_enter, io_uring_register. Deployment: - A Pulumi DaemonSet (services/seccomp.ts) copies the profile from a ConfigMap to /var/lib/kubelet/seccomp/profiles/ on every node - The controller and benchmark-main workflow reference it via seccompProfile.type: Localhost No init containers needed — the DaemonSet keeps the profile present on all nodes continuously.
DataFusion now includes io_uring-based local file I/O (apache/datafusion#21673), but the default container seccomp profile blocks io_uring_setup. Adds a checked-in seccomp profile (services/seccomp/io-uring-allowed.json) that's the standard containerd default allowlist plus three syscalls: io_uring_setup, io_uring_enter, io_uring_register. Deployment: - A Pulumi DaemonSet (services/seccomp.ts) copies the profile from a ConfigMap to /var/lib/kubelet/seccomp/profiles/ on every node - The controller and benchmark-main workflow reference it via seccompProfile.type: Localhost No init containers needed — the DaemonSet keeps the profile present on all nodes continuously.
Which issue does this PR close?
Rationale for this change
Local file reads in DataFusion use
object_store::local::LocalFileSystem, which issues onepread()per byte range. For Parquet column chunk reads this means many individual syscalls. Linux's io_uring allows batching these into a singleio_uring_enter().What changes are included in this PR?
New crate
datafusion-object-store-iouringprovidingIoUringObjectStore:get_ranges()submits all byte ranges as SQEs in oneio_uring_enter()LocalFileSystemwhen unavailable (EPERM in Docker/seccomp, non-Linux platforms)io-uringondatafusion-executionanddatafusion, enabled by defaultArchitecture:
To actually enable io_uring in benchmark environments, a companion PR adds a custom seccomp profile: adriangb/datafusion-benchmarking#4.
Are these changes tested?
6 unit tests covering put/get round-trip, single and multi-range reads, head, list, and empty-range edge cases. The io_uring code path requires a Linux host with io_uring support; on macOS CI the fallback path is exercised.
Are there any user-facing changes?
New optional crate and default feature flag. No behavioral changes when io_uring is unavailable — transparent fallback to
LocalFileSystem.