ADR: Alternate Object Identifier Strategy for Git LFS Without Content SHA256
Status
Proposed
Use Case: Managing Remote-Only Large Files Without Content Hashing
Title
Enable Git LFS workflows for remote-only or very large files without requiring SHA256 content hashing during git add.
Primary Actor
Research data steward / data engineer using Git LFS integrated with DRS.
Scenario
A research team maintains large genomic files (e.g., BAM, CRAM, FASTQ) that are:
- Already stored in an object store (S3 / Ceph / GCS)
- Registered in DRS
- Multi-GB or TB in size
- Not always locally downloaded
The user wants to:
- Reference these files in a Git repository
- Track them via Git LFS
- Maintain reproducibility and metadata linkage
- Avoid computing SHA256 hashes locally (too slow or impossible)
User Story
As a research data steward managing large, remote DRS-registered files,
I want to add files to a Git LFS repository without computing a full content SHA256 hash,
So that I can efficiently reference remote objects while maintaining compatibility with Git LFS and DRS workflows.
Functional Expectations
-
During git add, the clean filter:
- Does not require downloading or hashing full file contents.
- Generates a stable alternate object identifier.
-
During git lfs push:
- Remote existence checks use DRS or metadata services.
-
During git checkout:
- Files are resolved via DRS ID.
-
No additional metadata files are committed to Git.
-
Integrity and deduplication are delegated to DRS.
Acceptance Criteria
Business / Architectural Value
- Eliminates expensive SHA256 operations on large files.
- Enables remote-first, metadata-addressable architecture.
- Aligns Git workflows with DRS and Indexd object identity.
- Supports scalable genomics and bioinformatics data management.
ADR: Alternate Object Identifier Strategy for Git LFS Without Content SHA256
Status
Proposed
Use Case: Managing Remote-Only Large Files Without Content Hashing
Title
Enable Git LFS workflows for remote-only or very large files without requiring SHA256 content hashing during
git add.Primary Actor
Research data steward / data engineer using Git LFS integrated with DRS.
Scenario
A research team maintains large genomic files (e.g., BAM, CRAM, FASTQ) that are:
The user wants to:
User Story
As a research data steward managing large, remote DRS-registered files,
I want to add files to a Git LFS repository without computing a full content SHA256 hash,
So that I can efficiently reference remote objects while maintaining compatibility with Git LFS and DRS workflows.
Functional Expectations
During
git add, the clean filter:During
git lfs push:During
git checkout:No additional metadata files are committed to Git.
Integrity and deduplication are delegated to DRS.
Acceptance Criteria
git addon remote-managed large files without content hashing delays.git lfs pushdoes not attempt redundant uploads when DRS already contains the object.git checkoutcorrectly restores files via DRS resolution.Business / Architectural Value