Conversation
There was a problem hiding this comment.
Pull request overview
This PR is an experimental/WIP update that extends git drs add-url and the Git LFS custom transfer flow to support “remote URL” objects by using cloud HEAD metadata (ETag) to create sentinel LFS objects, plus new cloud-provider URL detection and download support.
Changes:
- Add cloud URL HEAD support for S3/GCS/Azure (platform detection + normalized metadata) and a generalized downloader for DRS objects.
- Rework
add-urlto accept generic cloud URLs and record remote-url aliases/checksums, writing a sentinel LFS object and pointer file. - Update transfer/prepush/drsmap behavior and tests/scripts to accommodate remote-url objects and aliases.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/coverage-test.sh | Extends integration script to exercise add-url and remote-url behavior. |
| lfs/store_test.go | Updates test fixture to include aliases in stored DRS objects. |
| lfs/lfs.go | Adjusts dry-run parsing logic (pointer-file detection currently commented out). |
| go.mod | Bumps indirect deps and adds a local replace (currently problematic). |
| go.sum | Dependency checksum updates aligned with module changes. |
| drsmap/drs_map_test.go | Updates WriteDrsFile call signature in tests. |
| drsmap/drs_map.go | Adds logging and extends WriteDrsFile to merge aliases/checksums; adjusts object selection logic. |
| cmd/transfer/main.go | Adds remote-url download path handling and sentinel/real-object bookkeeping. |
| cmd/prepush/main.go | Adds additional discovery logging and normalizes cached OID formatting. |
| cmd/addurl/service.go | Reworks add-url workflow to use cloud HEAD + sentinel LFS object + aliases/checksums. |
| cmd/addurl/main_test.go | Updates add-url tests for ETag-based sentinel behavior; adds ensureAllowIncompletePush tests. |
| cmd/addurl/main.go | Changes CLI to add-url <cloud-url> [path] and removes AWS-specific flags (now env-driven via cloud). |
| cmd/addurl/io.go | Updates resolved-info output to show normalized remote object metadata. |
| cloud/head.go | New: provider detection + HEAD implementations for S3/GCS/Azure. |
| cloud/head_test.go | New: unit tests for platform detection, URL parsing, and HEAD behavior via httptest. |
| cloud/downloader.go | Generalizes download to work across providers (via access URL) and write into LFS objects. |
| client/indexd/register.go | Adds remote-url alias handling to skip upload and delete sentinel LFS object. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "test update" | ||
| exit 1 |
There was a problem hiding this comment.
This test script currently hard-fails with exit 1, which will break CI/integration runs. Please remove the debugging echo/exit 1 and complete or gate the remaining update test logic.
| //if size > 0 && size < 2048 { | ||
| // if data, readErr := os.ReadFile(absPath); readErr == nil { | ||
| // s := strings.TrimSpace(string(data)) | ||
| // if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") { | ||
| // logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path)) | ||
| // continue | ||
| // } | ||
| // } | ||
| //} |
There was a problem hiding this comment.
The LFS pointer-file detection safeguard has been commented out. Without this check, the dry-run parser can treat small pointer files as upload candidates, which risks registering/uploading pointer contents instead of the real LFS object. Restore this guard (or replace it with an equivalent, tested mechanism) so pointer files are skipped reliably.
| //if size > 0 && size < 2048 { | |
| // if data, readErr := os.ReadFile(absPath); readErr == nil { | |
| // s := strings.TrimSpace(string(data)) | |
| // if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") { | |
| // logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path)) | |
| // continue | |
| // } | |
| // } | |
| //} | |
| if size > 0 && size < 2048 { | |
| if data, readErr := os.ReadFile(absPath); readErr == nil { | |
| s := strings.TrimSpace(string(data)) | |
| if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") { | |
| logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path)) | |
| continue | |
| } | |
| } | |
| } |
| } | ||
|
|
||
| if err := writePointerFile(input.path, oid, s3Info.SizeBytes); err != nil { | ||
| if err := writePointerFile(pathArg, oid, int64(len(objMeta.ETag))); err != nil { |
There was a problem hiding this comment.
writePointerFile is being called with len(objMeta.ETag) as the LFS pointer size. The pointer size field is supposed to reflect the real object size (for UX/progress and some tooling), and objMeta.SizeBytes is already available. Please pass the remote object's byte size here instead of ETag length.
| if err := writePointerFile(pathArg, oid, int64(len(objMeta.ETag))); err != nil { | |
| if err := writePointerFile(pathArg, oid, objMeta.SizeBytes); err != nil { |
| file := lfs.LfsFileInfo{ | ||
| Name: input.path, | ||
| Size: s3Info.SizeBytes, | ||
| Name: pathArg, | ||
| Size: int64(len(objMeta.ETag)), | ||
| Oid: oid, |
There was a problem hiding this comment.
The DRS/LFS file record is being built with Size: int64(len(objMeta.ETag)). This should be the remote object size (objMeta.SizeBytes), not the length of the ETag string, otherwise downstream logic (pointer generation, index metadata, progress reporting) will be incorrect.
| opts.Logger.Error(fmt.Sprintf("Could not build DRS object for %s OID %s %v", file.Name, file.Oid, err)) | ||
| continue | ||
|
|
||
| authoritativeObj, _ := DrsInfoFromOid(file.Oid) |
There was a problem hiding this comment.
DrsInfoFromOid errors are being ignored here. If the local DRS object file exists but is corrupted/unreadable, this will silently fall back to building a new object and can mask data issues. Handle the error explicitly (e.g., ignore only os.IsNotExist-like cases, otherwise log/return).
| authoritativeObj, _ := DrsInfoFromOid(file.Oid) | |
| authoritativeObj, err := DrsInfoFromOid(file.Oid) | |
| if err != nil { | |
| if errors.Is(err, os.ErrNotExist) { | |
| opts.Logger.Debug(fmt.Sprintf("No existing DRS object for OID %s: %v", file.Oid, err)) | |
| } else { | |
| opts.Logger.Error(fmt.Sprintf("Failed to read existing DRS object for OID %s: %v", file.Oid, err)) | |
| continue | |
| } | |
| } |
| } else { | ||
| computedSHA, dstPath, err := cloud.Download(ctx, matchingRecord) | ||
| if err != nil { |
There was a problem hiding this comment.
Inside the remote-URL branch, computedSHA, dstPath, err := cloud.Download(...) uses := and shadows the outer dstPath/err. As a result, after this block the handler will still use the old dstPath when writing the completion message, and any error from cloud.Download can be masked outside the block. Use assignment (=) and avoid shadowing so the final dstPath reflects what was actually written.
| // the file should already exist as a pointer | ||
| if _, err := os.Stat(matchingRecord.Name); err != nil { | ||
| errMsg := fmt.Sprintf("stat %s: %v", matchingRecord.Name, err) | ||
| logger.ErrorContext(ctx, errMsg) | ||
| lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg) | ||
| continue | ||
| } | ||
| // write the 'real file' to the expected location in the .git/lfs/objects directory | ||
| if err := os.Rename(dstPath, matchingRecord.Name); err != nil { | ||
| errMsg := fmt.Sprintf("rename %s %s: %v", dstPath, matchingRecord.Name, err) | ||
| logger.ErrorContext(ctx, errMsg) | ||
| lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg) | ||
| continue | ||
| } |
There was a problem hiding this comment.
This rename moves the downloaded LFS object (dstPath, which is in .git/lfs/objects/...) to matchingRecord.Name (which is the worktree path recorded in the DRS object). The transfer process should write into the path requested by Git LFS (typically under .git/lfs/objects / the downloadMsg.Path), not overwrite the worktree pointer file. This will also make the dstPath reported in the completion message incorrect because the file no longer exists at that location.
| // the file should already exist as a pointer | |
| if _, err := os.Stat(matchingRecord.Name); err != nil { | |
| errMsg := fmt.Sprintf("stat %s: %v", matchingRecord.Name, err) | |
| logger.ErrorContext(ctx, errMsg) | |
| lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg) | |
| continue | |
| } | |
| // write the 'real file' to the expected location in the .git/lfs/objects directory | |
| if err := os.Rename(dstPath, matchingRecord.Name); err != nil { | |
| errMsg := fmt.Sprintf("rename %s %s: %v", dstPath, matchingRecord.Name, err) | |
| logger.ErrorContext(ctx, errMsg) | |
| lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg) | |
| continue | |
| } | |
| // leave the downloaded object at dstPath (under .git/lfs/objects) as requested by Git LFS; | |
| // do not overwrite the worktree pointer file at matchingRecord.Name. |
| } else { | ||
| computedSHA, dstPath, err := cloud.Download(ctx, matchingRecord) | ||
| if err != nil { | ||
| errMsg := fmt.Sprintf("Error downloading file from remote url %s oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err) |
There was a problem hiding this comment.
This log message uses %s with matchingRecord.AccessMethods[0].AccessURL (a struct), which will produce %!s(...) at runtime and is easy to miss. Use matchingRecord.AccessMethods[0].AccessURL.URL (string) or %v if you intend to log the whole struct.
| errMsg := fmt.Sprintf("Error downloading file from remote url %s oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err) | |
| errMsg := fmt.Sprintf("Error downloading file from remote url %v oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err) |
| logger.Info(fmt.Sprintf("Discovered %d LFS files via dry-run: %v", len(lfsFiles), lfsFiles)) | ||
|
|
||
| fmt.Fprintln(os.Stderr, fmt.Sprintf("Discovered %d LFS files via dry-run. %v", len(lfsFiles), lfsFiles)) |
There was a problem hiding this comment.
These lines print the full lfsFiles map (potentially very large) both at Info level and to stderr. This can flood hook output and leak paths/OIDs in normal operation. Consider downgrading to Debug and logging only a summary/count (or truncating the list).
| logger.Info(fmt.Sprintf("Discovered %d LFS files via dry-run: %v", len(lfsFiles), lfsFiles)) | |
| fmt.Fprintln(os.Stderr, fmt.Sprintf("Discovered %d LFS files via dry-run. %v", len(lfsFiles), lfsFiles)) | |
| logger.Info("Discovered LFS files via dry-run", "count", len(lfsFiles)) | |
| fmt.Fprintf(os.Stderr, "Discovered %d LFS files via dry-run.\n", len(lfsFiles)) |
| //if _, err := os.Stat(lfsObject); err == nil { | ||
| // t.Fatalf("Did not expect a LFS object at %s: %v", lfsObject, err) | ||
| //} |
There was a problem hiding this comment.
Please remove the commented-out test assertion block. Keeping dead/commented code in tests makes intent unclear and tends to rot; either assert the expected behavior or delete the block.
| //if _, err := os.Stat(lfsObject); err == nil { | |
| // t.Fatalf("Did not expect a LFS object at %s: %v", lfsObject, err) | |
| //} |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e076dc0561
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // | ||
|
|
||
| replace github.com/calypr/data-client => /Users/walsbr/calypr/data-client |
There was a problem hiding this comment.
Remove workstation-only replace directive from module file
The new replace github.com/calypr/data-client => /Users/walsbr/calypr/data-client hard-codes a developer-local path, so builds on CI or any other machine fail module resolution before tests can run. This makes the repository non-portable and blocks normal go test/go build for everyone except the original workstation.
Useful? React with 👍 / 👎.
| --endpoint-url $ENDPOINT \ | ||
| --region us-east-1 | ||
| echo "test update" | ||
| exit 1 |
There was a problem hiding this comment.
Remove forced early exit from coverage workflow script
This unconditional exit 1 aborts the script mid-run, so every invocation fails and all subsequent add-url regression checks in this script are skipped. In CI contexts that execute tests/coverage-test.sh, this turns the pipeline red regardless of actual product behavior.
Useful? React with 👍 / 👎.
| //if size > 0 && size < 2048 { | ||
| // if data, readErr := os.ReadFile(absPath); readErr == nil { | ||
| // s := strings.TrimSpace(string(data)) | ||
| // if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") { | ||
| // logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path)) |
There was a problem hiding this comment.
Restore pointer-file filtering in dry-run LFS discovery
Commenting out this guard means pointer files are no longer excluded when the worktree contains raw LFS pointers (for example with smudge disabled). addFilesFromDryRun will then pass pointer paths forward, and RegisterFile uploads bytes from path directly, which can publish pointer text instead of object content under the target OID.
Useful? React with 👍 / 👎.
Experimental, not for merge
At a high level, this approach "manipulates" the git-lfs client by performing a HEAD of the bucket object and creating a sentinel file.
It is an improvement in that the file does not need to be downloaded and the sha256 calculated when it is registered.
pull, the client will retrieve the "real" file, and lfs will mark it as "dirty". As subsequentpushwill update the indexd/drs server with the "real" sha256This approach leaves key management in the hands of the users. No keys are stored and cloud platform specific environmental variables are harvested at run time.
There are several loose ends, for example, "out of band updates to the server file" is not supported
If we want to continue with this thread, the code should be cleaned up (see co-pilot's comments below)
However, as is, it may be useful to highlight what is happening "under the covers" within the
git lfsclient.ADR-0001:
add-urlSentinel Design for Cloud Bucket Referencescmd/addurl/*,cloud/head.goContext
The
git drs add-urlcommand was expanded from a narrowly scoped S3-only path into a generalized cloud URL ingestion workflow. The current implementation inspects object metadata via provider-specific HEAD (or equivalent) calls and then writes a local Git LFS pointer plus DRS mapping without downloading file payload bytes.This design exists to support "reference-style" registration of externally hosted data in cloud buckets where data movement is expensive or undesired.
The design also intentionally aligns with the Git LFS custom transfer agent architecture used by Git DRS.
add-urlprepares pointer/mapping state so that later transfer operations can resolve and hydrate through the existing custom-transfer path instead of requiring a full data copy during registration.Decision
We intentionally keep
add-urlexperimental and not intended for production-critical workflows at this stage.1) Experimental scope and responsibilities
add-urlnow collects bucket object metadata from:s3://..., S3 HTTPS forms)gs://...,storage.googleapis.com/...)az://...,https://<account>.blob.core.windows.net/...)The command records metadata and writes a Git LFS pointer, but it does not ingest data bytes into local LFS content storage as the canonical file.
Instead, the client is responsible for loading data directly from the underlying bucket during first real data access.
2) Sentinel data based on bucket HEAD-derived ETag
When
--sha256is not supplied,add-urlcomputes a synthetic OID as:computed_oid = sha256(<etag string>)It then writes a local object at the LFS object path containing sentinel bytes (the ETag string itself). This allows Git LFS plumbing to see an object where expected while avoiding immediate full-object download.
In effect, this uses sentinel content to satisfy local Git LFS shape/expectations while deferring true content validation and hydration.
3) Real SHA256 deferred to first-use by client
The command's synthetic OID is an implementation convenience, not proof of file-content integrity.
The real content SHA256 must be computed by the client on first use (i.e., when actual bytes are fetched/streamed from cloud storage) and then treated as the authoritative checksum for integrity-sensitive operations.
4) LFS custom transfer agent remains the hydration path
add-urlis a registration path, not a replacement transfer mechanism. The custom transfer agent remains the component responsible for runtime materialization and data movement behavior:add-urlcreates pointer + DRS mapping state.This keeps
add-urllightweight while preserving consistency with existing Git DRS transfer behavior.Flow diagram
flowchart TD A[User runs git drs add-url <cloud-url>] --> B[cloud.HeadObject<br/>S3 / GCS / Azure] B --> C{--sha256 provided?} C -- Yes --> D[Use provided SHA256 as OID] C -- No --> E[Compute synthetic OID = sha256 ETag ] E --> F[Write sentinel object bytes = ETag<br/>.git/lfs/objects/...] D --> G[Write Git LFS pointer file] F --> G G --> H[Write DRS mapping with remote URL] H --> I[Commit/push as normal] I --> J[Git LFS custom transfer agent executes on transfer] J --> K[Client loads bytes from bucket] K --> L[Client computes real SHA256 on first use]Sequence diagram (
add-urlregistration vs transfer-time hydration)--
Consequences
Positive
add-url.Negative / Risks
Guardrails
add-urlas experimental, with explicit user-facing caveats.--sha256when authoritative content hash is already known.Rejected alternatives
Always require explicit
--sha256and fail otherwise.Always download full object at
add-urltime to compute true SHA256.Store only pointer metadata and no local LFS object artifact.
Rollout / follow-up