Skip to content

EXPERIMENTAL: feature/add-url-md5-sha256#216

Open
bwalsh wants to merge 3 commits intomainfrom
feature/add-url-md5-sha256
Open

EXPERIMENTAL: feature/add-url-md5-sha256#216
bwalsh wants to merge 3 commits intomainfrom
feature/add-url-md5-sha256

Conversation

@bwalsh
Copy link
Copy Markdown
Contributor

@bwalsh bwalsh commented Mar 26, 2026

Experimental, not for merge

At a high level, this approach "manipulates" the git-lfs client by performing a HEAD of the bucket object and creating a sentinel file.

It is an improvement in that the file does not need to be downloaded and the sha256 calculated when it is registered.

  • For now, that calculation is still done by the client, but it is deferred until "first use"
  • ie. On first pull, the client will retrieve the "real" file, and lfs will mark it as "dirty". As subsequent push will update the indexd/drs server with the "real" sha256

This approach leaves key management in the hands of the users. No keys are stored and cloud platform specific environmental variables are harvested at run time.

There are several loose ends, for example, "out of band updates to the server file" is not supported

If we want to continue with this thread, the code should be cleaned up (see co-pilot's comments below)
However, as is, it may be useful to highlight what is happening "under the covers" within the git lfs client.

ADR-0001: add-url Sentinel Design for Cloud Bucket References

  • Status: (Experimental)
  • Date: 2026-03-26
  • Deciders: Git DRS maintainers
  • Related code: cmd/addurl/*, cloud/head.go

Context

The git drs add-url command was expanded from a narrowly scoped S3-only path into a generalized cloud URL ingestion workflow. The current implementation inspects object metadata via provider-specific HEAD (or equivalent) calls and then writes a local Git LFS pointer plus DRS mapping without downloading file payload bytes.

This design exists to support "reference-style" registration of externally hosted data in cloud buckets where data movement is expensive or undesired.

The design also intentionally aligns with the Git LFS custom transfer agent architecture used by Git DRS. add-url prepares pointer/mapping state so that later transfer operations can resolve and hydrate through the existing custom-transfer path instead of requiring a full data copy during registration.

Decision

We intentionally keep add-url experimental and not intended for production-critical workflows at this stage.

1) Experimental scope and responsibilities

add-url now collects bucket object metadata from:

  • S3 (s3://..., S3 HTTPS forms)
  • Google Cloud Storage (gs://..., storage.googleapis.com/...)
  • Azure Blob Storage (az://..., https://<account>.blob.core.windows.net/...)

The command records metadata and writes a Git LFS pointer, but it does not ingest data bytes into local LFS content storage as the canonical file.

Instead, the client is responsible for loading data directly from the underlying bucket during first real data access.

2) Sentinel data based on bucket HEAD-derived ETag

When --sha256 is not supplied, add-url computes a synthetic OID as:

  • computed_oid = sha256(<etag string>)

It then writes a local object at the LFS object path containing sentinel bytes (the ETag string itself). This allows Git LFS plumbing to see an object where expected while avoiding immediate full-object download.

In effect, this uses sentinel content to satisfy local Git LFS shape/expectations while deferring true content validation and hydration.

3) Real SHA256 deferred to first-use by client

The command's synthetic OID is an implementation convenience, not proof of file-content integrity.

The real content SHA256 must be computed by the client on first use (i.e., when actual bytes are fetched/streamed from cloud storage) and then treated as the authoritative checksum for integrity-sensitive operations.

4) LFS custom transfer agent remains the hydration path

add-url is a registration path, not a replacement transfer mechanism. The custom transfer agent remains the component responsible for runtime materialization and data movement behavior:

  • add-url creates pointer + DRS mapping state.
  • normal Git/Git LFS operations invoke custom transfer.
  • custom transfer resolves remote metadata/URLs and performs actual byte movement as needed.

This keeps add-url lightweight while preserving consistency with existing Git DRS transfer behavior.

Flow diagram

flowchart TD
    A[User runs git drs add-url <cloud-url>] --> B[cloud.HeadObject<br/>S3 / GCS / Azure]
    B --> C{--sha256 provided?}
    C -- Yes --> D[Use provided SHA256 as OID]
    C -- No --> E[Compute synthetic OID = sha256 ETag ]
    E --> F[Write sentinel object bytes = ETag<br/>.git/lfs/objects/...]
    D --> G[Write Git LFS pointer file]
    F --> G
    G --> H[Write DRS mapping with remote URL]
    H --> I[Commit/push as normal]
    I --> J[Git LFS custom transfer agent executes on transfer]
    J --> K[Client loads bytes from bucket]
    K --> L[Client computes real SHA256 on first use]
Loading

Sequence diagram (add-url registration vs transfer-time hydration)

--
 

sequenceDiagram
autonumber
participant U as User
participant A as git drs add-url
participant C as cloud.HeadObject
participant L as Local Git/LFS store
participant M as DRS mapping
participant T as Git LFS custom transfer
participant B as Cloud bucket
 
U->>A: git drs add-url <cloud-url> [--sha256]
A->>C: Inspect object metadata (HEAD/equivalent)
C-->>A: platform, bucket, key, ETag, metadata
 
alt --sha256 supplied
A->>L: Write pointer using provided SHA256
else --sha256 omitted
A->>A: syntheticOID = sha256(ETag)
A->>L: Write sentinel object bytes = ETag
A->>L: Write pointer using syntheticOID
end
 
A->>M: Write DRS mapping (URL + checksums metadata)
A-->>U: Registration complete (no full payload download)
 
U->>T: git lfs pull / checkout / transfer-triggering operation
T->>M: Resolve remote URL + object metadata
T->>B: Fetch object bytes on demand
B-->>T: Object stream
T->>T: Compute authoritative SHA256 on first use
T->>L: Hydrate local content + integrity state
Loading

Consequences

Positive

  • Avoids immediate full data transfer during add-url.
  • Allows low-latency registration of bucket-hosted objects.
  • Supports multi-cloud bucket metadata discovery via a common flow.

Negative / Risks

  • Synthetic OID can be confused with true content digest if callers are not explicit.
  • ETag semantics vary across providers/configurations (multipart uploads, encryption modes, transformations), so ETag-derived values are not universally content hashes.
  • Tooling assumptions that all LFS OIDs are true file SHA256 values do not hold for this path.
  • If consumers bypass the custom transfer path, first-use checksum correction/hydration guarantees may not be applied consistently.

Guardrails

  • Treat add-url as experimental, with explicit user-facing caveats.
  • Preserve ability to pass --sha256 when authoritative content hash is already known.
  • Document that first-use client download must compute and persist/propagate true SHA256 metadata.

Rejected alternatives

  1. Always require explicit --sha256 and fail otherwise.

    • Rejected because many reference workflows know location before checksum.
  2. Always download full object at add-url time to compute true SHA256.

    • Rejected due to cost/latency and because it defeats the primary goal of reference-style registration.
  3. Store only pointer metadata and no local LFS object artifact.

    • Rejected because existing Git LFS expectations are easier to satisfy with a sentinel object layout.

Rollout / follow-up

  • Keep this behavior behind clear documentation and release notes indicating experimental status.
  • Add/expand end-to-end tests for provider ETag corner cases.
  • Define an explicit metadata field/state to distinguish synthetic OID from validated content SHA256.

Copilot AI review requested due to automatic review settings March 26, 2026 00:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is an experimental/WIP update that extends git drs add-url and the Git LFS custom transfer flow to support “remote URL” objects by using cloud HEAD metadata (ETag) to create sentinel LFS objects, plus new cloud-provider URL detection and download support.

Changes:

  • Add cloud URL HEAD support for S3/GCS/Azure (platform detection + normalized metadata) and a generalized downloader for DRS objects.
  • Rework add-url to accept generic cloud URLs and record remote-url aliases/checksums, writing a sentinel LFS object and pointer file.
  • Update transfer/prepush/drsmap behavior and tests/scripts to accommodate remote-url objects and aliases.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/coverage-test.sh Extends integration script to exercise add-url and remote-url behavior.
lfs/store_test.go Updates test fixture to include aliases in stored DRS objects.
lfs/lfs.go Adjusts dry-run parsing logic (pointer-file detection currently commented out).
go.mod Bumps indirect deps and adds a local replace (currently problematic).
go.sum Dependency checksum updates aligned with module changes.
drsmap/drs_map_test.go Updates WriteDrsFile call signature in tests.
drsmap/drs_map.go Adds logging and extends WriteDrsFile to merge aliases/checksums; adjusts object selection logic.
cmd/transfer/main.go Adds remote-url download path handling and sentinel/real-object bookkeeping.
cmd/prepush/main.go Adds additional discovery logging and normalizes cached OID formatting.
cmd/addurl/service.go Reworks add-url workflow to use cloud HEAD + sentinel LFS object + aliases/checksums.
cmd/addurl/main_test.go Updates add-url tests for ETag-based sentinel behavior; adds ensureAllowIncompletePush tests.
cmd/addurl/main.go Changes CLI to add-url <cloud-url> [path] and removes AWS-specific flags (now env-driven via cloud).
cmd/addurl/io.go Updates resolved-info output to show normalized remote object metadata.
cloud/head.go New: provider detection + HEAD implementations for S3/GCS/Azure.
cloud/head_test.go New: unit tests for platform detection, URL parsing, and HEAD behavior via httptest.
cloud/downloader.go Generalizes download to work across providers (via access URL) and write into LFS objects.
client/indexd/register.go Adds remote-url alias handling to skip upload and delete sentinel LFS object.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/coverage-test.sh
Comment on lines +335 to +336
echo "test update"
exit 1
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test script currently hard-fails with exit 1, which will break CI/integration runs. Please remove the debugging echo/exit 1 and complete or gate the remaining update test logic.

Copilot uses AI. Check for mistakes.
Comment thread lfs/lfs.go
Comment on lines +173 to +181
//if size > 0 && size < 2048 {
// if data, readErr := os.ReadFile(absPath); readErr == nil {
// s := strings.TrimSpace(string(data))
// if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") {
// logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path))
// continue
// }
// }
//}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LFS pointer-file detection safeguard has been commented out. Without this check, the dry-run parser can treat small pointer files as upload candidates, which risks registering/uploading pointer contents instead of the real LFS object. Restore this guard (or replace it with an equivalent, tested mechanism) so pointer files are skipped reliably.

Suggested change
//if size > 0 && size < 2048 {
// if data, readErr := os.ReadFile(absPath); readErr == nil {
// s := strings.TrimSpace(string(data))
// if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") {
// logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path))
// continue
// }
// }
//}
if size > 0 && size < 2048 {
if data, readErr := os.ReadFile(absPath); readErr == nil {
s := strings.TrimSpace(string(data))
if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") {
logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path))
continue
}
}
}

Copilot uses AI. Check for mistakes.
Comment thread cmd/addurl/service.go
}

if err := writePointerFile(input.path, oid, s3Info.SizeBytes); err != nil {
if err := writePointerFile(pathArg, oid, int64(len(objMeta.ETag))); err != nil {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writePointerFile is being called with len(objMeta.ETag) as the LFS pointer size. The pointer size field is supposed to reflect the real object size (for UX/progress and some tooling), and objMeta.SizeBytes is already available. Please pass the remote object's byte size here instead of ETag length.

Suggested change
if err := writePointerFile(pathArg, oid, int64(len(objMeta.ETag))); err != nil {
if err := writePointerFile(pathArg, oid, objMeta.SizeBytes); err != nil {

Copilot uses AI. Check for mistakes.
Comment thread cmd/addurl/service.go
Comment on lines 126 to 129
file := lfs.LfsFileInfo{
Name: input.path,
Size: s3Info.SizeBytes,
Name: pathArg,
Size: int64(len(objMeta.ETag)),
Oid: oid,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DRS/LFS file record is being built with Size: int64(len(objMeta.ETag)). This should be the remote object size (objMeta.SizeBytes), not the length of the ETag string, otherwise downstream logic (pointer generation, index metadata, progress reporting) will be incorrect.

Copilot uses AI. Check for mistakes.
Comment thread drsmap/drs_map.go
opts.Logger.Error(fmt.Sprintf("Could not build DRS object for %s OID %s %v", file.Name, file.Oid, err))
continue

authoritativeObj, _ := DrsInfoFromOid(file.Oid)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DrsInfoFromOid errors are being ignored here. If the local DRS object file exists but is corrupted/unreadable, this will silently fall back to building a new object and can mask data issues. Handle the error explicitly (e.g., ignore only os.IsNotExist-like cases, otherwise log/return).

Suggested change
authoritativeObj, _ := DrsInfoFromOid(file.Oid)
authoritativeObj, err := DrsInfoFromOid(file.Oid)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
opts.Logger.Debug(fmt.Sprintf("No existing DRS object for OID %s: %v", file.Oid, err))
} else {
opts.Logger.Error(fmt.Sprintf("Failed to read existing DRS object for OID %s: %v", file.Oid, err))
continue
}
}

Copilot uses AI. Check for mistakes.
Comment thread cmd/transfer/main.go
Comment on lines +232 to +234
} else {
computedSHA, dstPath, err := cloud.Download(ctx, matchingRecord)
if err != nil {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the remote-URL branch, computedSHA, dstPath, err := cloud.Download(...) uses := and shadows the outer dstPath/err. As a result, after this block the handler will still use the old dstPath when writing the completion message, and any error from cloud.Download can be masked outside the block. Use assignment (=) and avoid shadowing so the final dstPath reflects what was actually written.

Copilot uses AI. Check for mistakes.
Comment thread cmd/transfer/main.go
Comment on lines +256 to +269
// the file should already exist as a pointer
if _, err := os.Stat(matchingRecord.Name); err != nil {
errMsg := fmt.Sprintf("stat %s: %v", matchingRecord.Name, err)
logger.ErrorContext(ctx, errMsg)
lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg)
continue
}
// write the 'real file' to the expected location in the .git/lfs/objects directory
if err := os.Rename(dstPath, matchingRecord.Name); err != nil {
errMsg := fmt.Sprintf("rename %s %s: %v", dstPath, matchingRecord.Name, err)
logger.ErrorContext(ctx, errMsg)
lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg)
continue
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rename moves the downloaded LFS object (dstPath, which is in .git/lfs/objects/...) to matchingRecord.Name (which is the worktree path recorded in the DRS object). The transfer process should write into the path requested by Git LFS (typically under .git/lfs/objects / the downloadMsg.Path), not overwrite the worktree pointer file. This will also make the dstPath reported in the completion message incorrect because the file no longer exists at that location.

Suggested change
// the file should already exist as a pointer
if _, err := os.Stat(matchingRecord.Name); err != nil {
errMsg := fmt.Sprintf("stat %s: %v", matchingRecord.Name, err)
logger.ErrorContext(ctx, errMsg)
lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg)
continue
}
// write the 'real file' to the expected location in the .git/lfs/objects directory
if err := os.Rename(dstPath, matchingRecord.Name); err != nil {
errMsg := fmt.Sprintf("rename %s %s: %v", dstPath, matchingRecord.Name, err)
logger.ErrorContext(ctx, errMsg)
lfs.WriteErrorMessage(streamEncoder, downloadMsg.Oid, 502, errMsg)
continue
}
// leave the downloaded object at dstPath (under .git/lfs/objects) as requested by Git LFS;
// do not overwrite the worktree pointer file at matchingRecord.Name.

Copilot uses AI. Check for mistakes.
Comment thread cmd/transfer/main.go
} else {
computedSHA, dstPath, err := cloud.Download(ctx, matchingRecord)
if err != nil {
errMsg := fmt.Sprintf("Error downloading file from remote url %s oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message uses %s with matchingRecord.AccessMethods[0].AccessURL (a struct), which will produce %!s(...) at runtime and is easy to miss. Use matchingRecord.AccessMethods[0].AccessURL.URL (string) or %v if you intend to log the whole struct.

Suggested change
errMsg := fmt.Sprintf("Error downloading file from remote url %s oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err)
errMsg := fmt.Sprintf("Error downloading file from remote url %v oid %s (GUID: %s): %v", matchingRecord.AccessMethods[0].AccessURL, downloadMsg.Oid, matchingRecord.Id, err)

Copilot uses AI. Check for mistakes.
Comment thread cmd/prepush/main.go
Comment on lines +257 to +259
logger.Info(fmt.Sprintf("Discovered %d LFS files via dry-run: %v", len(lfsFiles), lfsFiles))

fmt.Fprintln(os.Stderr, fmt.Sprintf("Discovered %d LFS files via dry-run. %v", len(lfsFiles), lfsFiles))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines print the full lfsFiles map (potentially very large) both at Info level and to stderr. This can flood hook output and leak paths/OIDs in normal operation. Consider downgrading to Debug and logging only a summary/count (or truncating the list).

Suggested change
logger.Info(fmt.Sprintf("Discovered %d LFS files via dry-run: %v", len(lfsFiles), lfsFiles))
fmt.Fprintln(os.Stderr, fmt.Sprintf("Discovered %d LFS files via dry-run. %v", len(lfsFiles), lfsFiles))
logger.Info("Discovered LFS files via dry-run", "count", len(lfsFiles))
fmt.Fprintf(os.Stderr, "Discovered %d LFS files via dry-run.\n", len(lfsFiles))

Copilot uses AI. Check for mistakes.
Comment thread cmd/addurl/main_test.go
Comment on lines +126 to +128
//if _, err := os.Stat(lfsObject); err == nil {
// t.Fatalf("Did not expect a LFS object at %s: %v", lfsObject, err)
//}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented-out test assertion block. Keeping dead/commented code in tests makes intent unclear and tends to rot; either assert the expected behavior or delete the block.

Suggested change
//if _, err := os.Stat(lfsObject); err == nil {
// t.Fatalf("Did not expect a LFS object at %s: %v", lfsObject, err)
//}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e076dc0561

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread go.mod

//

replace github.com/calypr/data-client => /Users/walsbr/calypr/data-client
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Remove workstation-only replace directive from module file

The new replace github.com/calypr/data-client => /Users/walsbr/calypr/data-client hard-codes a developer-local path, so builds on CI or any other machine fail module resolution before tests can run. This makes the repository non-portable and blocks normal go test/go build for everyone except the original workstation.

Useful? React with 👍 / 👎.

Comment thread tests/coverage-test.sh
--endpoint-url $ENDPOINT \
--region us-east-1
echo "test update"
exit 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove forced early exit from coverage workflow script

This unconditional exit 1 aborts the script mid-run, so every invocation fails and all subsequent add-url regression checks in this script are skipped. In CI contexts that execute tests/coverage-test.sh, this turns the pipeline red regardless of actual product behavior.

Useful? React with 👍 / 👎.

Comment thread lfs/lfs.go
Comment on lines +173 to +177
//if size > 0 && size < 2048 {
// if data, readErr := os.ReadFile(absPath); readErr == nil {
// s := strings.TrimSpace(string(data))
// if strings.Contains(s, "version https://git-lfs.github.com/spec/v1") && strings.Contains(s, "oid sha256:") {
// logger.Warn(fmt.Sprintf("WARNING: Detected upload of lfs pointer file %s skipping", path))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore pointer-file filtering in dry-run LFS discovery

Commenting out this guard means pointer files are no longer excluded when the worktree contains raw LFS pointers (for example with smudge disabled). addFilesFromDryRun will then pass pointer paths forward, and RegisterFile uploads bytes from path directly, which can publish pointer text instead of object content under the target OID.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants