Skip to content

feat: read inputs from cloud storage (S3, GCS, Azure, R2)#3052

Open
markphelps wants to merge 7 commits into
mainfrom
cloud-storage-inputs
Open

feat: read inputs from cloud storage (S3, GCS, Azure, R2)#3052
markphelps wants to merge 7 commits into
mainfrom
cloud-storage-inputs

Conversation

@markphelps

@markphelps markphelps commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What

Adds native support for reading model inputs from cloud object storage. Inputs referencing s3://, gs://, or az:// URLs are now downloaded by coglet (in Rust, via the object_store crate) before the predictor runs — the predictor receives an ordinary local file path.

Re: #2742 (inputs portion).

Why

Previously Cog only understood data:, http://, and https:// input URLs. Cloud users had to presign every object to an HTTPS URL upstream. This lets cog.Path/cog.File inputs reference cloud storage directly.

Scope

  • Inputs only. Uploading outputs to cloud storage is a deliberate follow-up.
  • Providers: S3, GCS, Azure Blob, and S3-compatible stores (Cloudflare R2, MinIO via AWS_ENDPOINT_URL).
  • Credentials come from each provider's standard environment variables (AWS_*, GOOGLE_*, AZURE_*), resolved by object_store's builders.

How it works

A new Rust module crates/coglet-python/src/cloud.rs detects cloud-scheme URLs and downloads them. The work is wired into prepare_input (input.rs) as a pre-pass that runs before the existing http/data coercion:

  1. Collect every cloud URL across File/Path fields (single values and list items), recording each location.
  2. Download all concurrently in a single tokio block_on using futures::try_join_all (the GIL is released via py.detach), matching the parallelism of the existing Python ThreadPoolExecutor HTTP path.
  3. Splice the resulting local temp-file paths back into the input dict in their original positions.

Downloaded temp files are registered in the same PreparedInput cleanup list that already unlinks HTTP-download temp files, so they're removed after the prediction.

Notes / decisions

  • No aws-lc-sys: object_store is configured to use rustls + ring (matching the workspace's existing TLS setup), so it does not pull in aws-lc-sys/cmake. Verified with cargo tree -i aws-lc-sys (absent).
  • Eager File inputs: cloud cog.File inputs are downloaded eagerly. HTTP cog.File streams lazily via URLFile; cog.Path is already eager for HTTP. Unifying lazy streaming across HTTP and cloud is left to a follow-up (see plan).

Testing

  • cargo test -p coglet-python67 passed, 0 failed, 14 ignored. New cloud::tests cover scheme detection, offline store construction (S3/GCS), the InMemory-backed download round-trip, parallel-download ordering, and empty-input handling.
  • mise run lint:rust (clippy) — clean.
  • mise run lint:rust:deny — licenses/advisories OK.
  • mise run build:coglet — wheel builds.
  • mise run docs:llmdocs/llms.txt regenerated; cloud-storage inputs documented in docs/http.md.

Not yet verified ⚠️

  • End-to-end MinIO/S3 smoke test could not be run in this environment (no Docker image pull / no local S3 server). Download logic is covered by object_store's InMemory tests, but a real S3-protocol round-trip should be exercised once before merge.
  • Integration test (testscript + MinIO) not added (optional, same infra dependency).

Follow-ups

  • Output uploads to cloud storage.
  • Move HTTP/data: input downloads from Python to Rust and design lazy streaming uniformly for HTTP + cloud File inputs.

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

Adds native cloud storage input support (S3, GCS, Azure, R2) via the object_store crate. Overall looks solid — clean error handling with thiserror, proper GIL release via py.detach, and good test coverage.

Two minor issues:

  1. Docs don't mention the azure:// scheme that the code supports.
  2. CloudSlot uses inconsistent key types between Single and ListItem variants.

Comment thread docs/http.md
natively before the prediction runs. Supported schemes:

- Amazon S3: `s3://bucket/key`
- Google Cloud Storage: `gs://bucket/key`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code also supports azure:// scheme (see cloud.rs:185), but only az:// is documented here. Users might not realize both are accepted.

Suggested change
- Google Cloud Storage: `gs://bucket/key`
- Azure Blob Storage: `az://container/key` (also accepts `azure://container/key`)


/// Where a cloud URL lives in the payload, so we can splice the downloaded
/// local path back into the same position after the parallel download.
enum CloudSlot {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloudSlot::Single stores the raw Py<PyAny> key while CloudSlot::ListItem stores a String key. This inconsistency means ListItem re-fetches the dict entry by string, which is redundant and could behave differently for non-string keys (even though Cog inputs should always have string keys). Consider using Py<PyAny> for both variants to keep Phase 3 symmetric:

Suggested change
enum CloudSlot {
enum CloudSlot {
/// payload[key] is a single string value.
Single { key: Py<PyAny> },
/// payload[key] is a list; the URL is at list index `idx`.
ListItem { key: Py<PyAny>, idx: usize },
}

@ask-bonk

ask-bonk Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I'm Bonk, and I've done a quick review of your PR. The batch review has been posted to PR #3052 with two minor suggestions:

  1. Documentation gapdocs/http.md doesn't mention the azure:// scheme that is_cloud_url() supports.
  2. Code consistencyCloudSlot::Single stores Py<PyAny> key while CloudSlot::ListItem stores String key; using the same type for both would make Phase 3 cleaner.

No critical issues found. Clean thiserror usage, proper GIL release with py.detach, and good test coverage.

github run

@ask-bonk

ask-bonk Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

LGTM

github run

@markphelps markphelps marked this pull request as ready for review June 9, 2026 21:11
@markphelps markphelps requested a review from a team as a code owner June 9, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant