feat: read inputs from cloud storage (S3, GCS, Azure, R2)#3052
feat: read inputs from cloud storage (S3, GCS, Azure, R2)#3052markphelps wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
Adds native cloud storage input support (S3, GCS, Azure, R2) via the object_store crate. Overall looks solid — clean error handling with thiserror, proper GIL release via py.detach, and good test coverage.
Two minor issues:
- Docs don't mention the
azure://scheme that the code supports. CloudSlotuses inconsistent key types betweenSingleandListItemvariants.
| natively before the prediction runs. Supported schemes: | ||
|
|
||
| - Amazon S3: `s3://bucket/key` | ||
| - Google Cloud Storage: `gs://bucket/key` |
There was a problem hiding this comment.
The code also supports azure:// scheme (see cloud.rs:185), but only az:// is documented here. Users might not realize both are accepted.
| - Google Cloud Storage: `gs://bucket/key` | |
| - Azure Blob Storage: `az://container/key` (also accepts `azure://container/key`) |
|
|
||
| /// Where a cloud URL lives in the payload, so we can splice the downloaded | ||
| /// local path back into the same position after the parallel download. | ||
| enum CloudSlot { |
There was a problem hiding this comment.
CloudSlot::Single stores the raw Py<PyAny> key while CloudSlot::ListItem stores a String key. This inconsistency means ListItem re-fetches the dict entry by string, which is redundant and could behave differently for non-string keys (even though Cog inputs should always have string keys). Consider using Py<PyAny> for both variants to keep Phase 3 symmetric:
| enum CloudSlot { | |
| enum CloudSlot { | |
| /// payload[key] is a single string value. | |
| Single { key: Py<PyAny> }, | |
| /// payload[key] is a list; the URL is at list index `idx`. | |
| ListItem { key: Py<PyAny>, idx: usize }, | |
| } |
|
I'm Bonk, and I've done a quick review of your PR. The batch review has been posted to PR #3052 with two minor suggestions:
No critical issues found. Clean |
|
LGTM |
What
Adds native support for reading model inputs from cloud object storage. Inputs referencing
s3://,gs://, oraz://URLs are now downloaded by coglet (in Rust, via theobject_storecrate) before the predictor runs — the predictor receives an ordinary local file path.Re: #2742 (inputs portion).
Why
Previously Cog only understood
data:,http://, andhttps://input URLs. Cloud users had to presign every object to an HTTPS URL upstream. This letscog.Path/cog.Fileinputs reference cloud storage directly.Scope
AWS_ENDPOINT_URL).AWS_*,GOOGLE_*,AZURE_*), resolved byobject_store's builders.How it works
A new Rust module
crates/coglet-python/src/cloud.rsdetects cloud-scheme URLs and downloads them. The work is wired intoprepare_input(input.rs) as a pre-pass that runs before the existinghttp/datacoercion:File/Pathfields (single values and list items), recording each location.tokioblock_onusingfutures::try_join_all(the GIL is released viapy.detach), matching the parallelism of the existing PythonThreadPoolExecutorHTTP path.Downloaded temp files are registered in the same
PreparedInputcleanup list that already unlinks HTTP-download temp files, so they're removed after the prediction.Notes / decisions
aws-lc-sys:object_storeis configured to use rustls +ring(matching the workspace's existing TLS setup), so it does not pull inaws-lc-sys/cmake. Verified withcargo tree -i aws-lc-sys(absent).Fileinputs: cloudcog.Fileinputs are downloaded eagerly. HTTPcog.Filestreams lazily viaURLFile;cog.Pathis already eager for HTTP. Unifying lazy streaming across HTTP and cloud is left to a follow-up (see plan).Testing
cargo test -p coglet-python— 67 passed, 0 failed, 14 ignored. Newcloud::testscover scheme detection, offline store construction (S3/GCS), theInMemory-backed download round-trip, parallel-download ordering, and empty-input handling.mise run lint:rust(clippy) — clean.mise run lint:rust:deny— licenses/advisories OK.mise run build:coglet— wheel builds.mise run docs:llm—docs/llms.txtregenerated; cloud-storage inputs documented indocs/http.md.Not yet verified⚠️
object_store'sInMemorytests, but a real S3-protocol round-trip should be exercised once before merge.Follow-ups
data:input downloads from Python to Rust and design lazy streaming uniformly for HTTP + cloudFileinputs.