Clean up git history to shrink repo size by PatrickRMiles · Pull Request #76 · LBANN/ScaFFold

PatrickRMiles · 2026-06-10T22:22:57Z

Removes ~1.2GB of accidentally committed .npy files and virtual environment binaries from git history using git filter-branch. Updated .gitignore to prevent future commits of these files. Repository size reduced from 1.2GB to 12MB.

* Update pyproject.toml * Update requirements.txt

* fix cli bug: must recalculate unet_layers in CLI since problem_scale can be overwritten iwhtout changing config object unet_layers * whitespace * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* Continue if checkpointing fails * Fixes for new distconv * enable running 1 epoch * Update trainer.py * Update trainer.py

* set checkpoint interval * truncate stats csv when loading from checkpoint if checkpoint is behind latest CSV entries * lint --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* add -l to torchrun-hpc restart command, preventing nested dir creation; also simplify restart script to more closely match default run method like ScaFFold/scripts/scaffold-tuolumne.job * give restarting torchrun the full path to the existing dir * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* init * debug * testing * Enable configuring n_categories * set checkpoint interval * cleanup * lint * Update trainer.py * Create benchmark_testing.yml

* Update versions and pin distconv & ccl. Add separate install for pypi * Ensure libfabric * Don't need spindle off anymore * Enforce cray-mpich 9.1.0 * patch all so files

* cd back after done * Do not set var. This will be set by hpclauncher

* Add num-shards to cli * lint * Update cli.py

…e hash ID (LBANN#26)

* remove open3d dependency * ruff * comments --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* update config with 3D num_shards and shard_dim * update config util to expect 3D num_shards and shard_dim, add helper to deal with 1D inputs * worker no longer needs to modify distconv params set in config -- just pass them as-is to the ParallelStrategy call * implement multi-dimensional sharding for distconv * update distconv param name scheme * fix loss calc * add sharded dice loss calculation to dice score util * update evaluate to use sharded dice loss calc * update trainer to use new evaluate; other small fixes/tweaks * fix assert * fix naming * fix naming * better default values * missing import * update distconv param names, default vals * use np.prod instead of math.prod * ruff * import math * warmup logging and timing * import math for prod * Add missing import * Remove extra func make cli arg tuple * lint * lint * Update configs --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>

* fix unet bottleneck dim off by 1 error * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job

…e clipping (LBANN#40) * apply optimizer every batch, not every epoch; unscale gradients before clipping * trainer tweaks --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

…n trainer class (LBANN#43) * apply optimizer every batch, not every epoch; unscale gradients before clipping * trainer tweaks * apply optimizer every batch, not every epoch; unscale gradients before clipping * extract warmup to separate method; switch to warming up set number of batches (user configurable) * whitespace; num_workers revert * ruff * make parallelstrategy, spatial_mesh, ddp_placements attrs of trainer; other small tweaks * remove deprecated config attrs * ruff * get device mesh from ps class attr * ruff * missing self. on some ps accesses * Fix imports and missing self.ps * rm legacy warmup_epochs * Move attributes to base class for clarity * remove warmup_epochs -- not useful to keep support for this * call cleanup_or_resume trainer method directly * rm unused vars --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>

- make dataloader num_workers user-configurable - shift dataloader preprocessing work into dataset generation for speedup, maintaining support for old datasets - data_loading.py: restore .contiguous() and dtype cast calls, but change order to avoid redundant copies

* Fix check * Fix flake

* Calculate local CE loss w/o AMP to prevent inf from f16 overflow * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* Warmup evaluation * cleanup

* enable option to never checkpoint and make default * Update config_utils.py

* check for nan dice_score_train * lint

* move to sharded data loading * bug fixes * ruff * restore missing import * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

* ignore background class * .item()

* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * Update trainer.py * Update trainer.py * Update trainer.py * Refactor * mv .item()

* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * cosine sched * Validation needs smaller batch size otherwise val_dice can be 0. * config * Update benchmark_default.yml * lint * undo val_batch_size * Fix dtypes * README

* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * cosine sched * Validation needs smaller batch size otherwise val_dice can be 0. * config * Update benchmark_default.yml * fix fractal base dir * fix config * lint * Update get_dataset.py

* use class weights in CE loss to make background less dominant; calc weights at trainer init * ruff * missing import * ruff * ruff * fix missing volume dtype * remove default ce_weight_num_samples Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com> * remove default ce_weight_num_samples in trainer Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com> * move ce loss helpers to losses.py * sample by fraction of total rather than hard number * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com>

* working rocm/7.2.1 * Change wheel source and use torch2.12 * Update install-tuolumne-torchpypi.sh * Update scaffold-tuolumne-torchpypi.job

* Add adiak metadata * Update worker.py

* fix dtypes for torch * Add per minibatch timer * cleanup

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

…armup (LBANN#68) * Use snapshot to prevent warmup from influencing training * Fix validation warmup and increase default warmup * Refactor shared trainer logic in warmup * better default

michaelmckinsey1 and others added 30 commits January 22, 2026 09:04

Changes required to make distconv a dependency (#1)

47b139c

Ensure hpc-launcher@1.0.4 is used (#5)

f6d120c

* Update pyproject.toml * Update requirements.txt

Fix model size mismatch on restart (LBANN#9)

42b7e7e

* fix cli bug: must recalculate unet_layers in CLI since problem_scale can be overwritten iwhtout changing config object unet_layers * whitespace * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Fix Minor Bugs Discovered During Testing (LBANN#7)

55f1c36

* Continue if checkpointing fails * Fixes for new distconv * enable running 1 epoch * Update trainer.py * Update trainer.py

Restore channels_last_3d

20d7575

Leverage a Checkpoint Interval to Speed Up Training (LBANN#12)

f8b657d

* set checkpoint interval * truncate stats csv when loading from checkpoint if checkpoint is behind latest CSV entries * lint --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Set Default Behavior to Stop Training Upon Convergence (LBANN#16)

fe31fab

* init * debug * testing * Enable configuring n_categories * set checkpoint interval * cleanup * lint * Update trainer.py * Create benchmark_testing.yml

Update to torch=2.10 and rocm=7.1 and Pin Versions (LBANN#17)

6a06474

* Update versions and pin distconv & ccl. Add separate install for pypi * Ensure libfabric * Don't need spindle off anymore * Enforce cray-mpich 9.1.0 * patch all so files

Remove redundant variable already set by hpc-launcher (LBANN#21)

7b46b1b

* cd back after done * Do not set var. This will be set by hpclauncher

Add num-shards and epochs to cli (LBANN#22)

8992d5a

* Add num-shards to cli * lint * Update cli.py

remove unet bottleneck dim from dataset params used to generate uniqu…

6319962

…e hash ID (LBANN#26)

remove open3d dependency (LBANN#25)

af04705

* remove open3d dependency * ruff * comments --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

fix .ply -> .npy (LBANN#30)

61c2c62

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

7.1.1 replacing 7.1.0 (LBANN#34)

20b2285

Accurately report finish criteria

7226c8a

fix unet bottleneck dim off by 1 error (LBANN#29)

87bd3d7

* fix unet bottleneck dim off by 1 error * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Apply optimizer every batch, not every epoch; unscale gradients befor…

e6856f1

…e clipping (LBANN#40) * apply optimizer every batch, not every epoch; unscale gradients before clipping * trainer tweaks --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Fix ruff check (LBANN#47)

adbb812

* Fix check * Fix flake

Fix inf val loss in early epochs (LBANN#50)

8db5f39

* Calculate local CE loss w/o AMP to prevent inf from f16 overflow * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Warmup evaluation step (LBANN#49)

6bd9e14

* Warmup evaluation * cleanup

Add configuration option to disable checkpointing (LBANN#48)

04c1b4a

* enable option to never checkpoint and make default * Update config_utils.py

Enforce samples are not repeated (LBANN#55)

ac98bbc

Exit if dice score is NaN (LBANN#54)

a36fb3a

* check for nan dice_score_train * lint

Move to sharded data loading (LBANN#52)

d6b7b64

* move to sharded data loading * bug fixes * ruff * restore missing import * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Missing Import (LBANN#57)

f66fff1

michaelmckinsey1 and others added 13 commits April 23, 2026 14:59

Consistently ignore background class in loss/dice (LBANN#59)

3194f6c

* ignore background class * .item()

Improve AMP stability (LBANN#60)

699cf5f

* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * Update trainer.py * Update trainer.py * Update trainer.py * Refactor * mv .item()

fix dtypes for torch (LBANN#65)

1db5916

Enable rocm/7.2.1 (LBANN#67)

d3f386a

* working rocm/7.2.1 * Change wheel source and use torch2.12 * Update install-tuolumne-torchpypi.sh * Update scaffold-tuolumne-torchpypi.job

Add metadata (LBANN#70)

2e8a160

* Add adiak metadata * Update worker.py

Enable timing minibatch (LBANN#66)

b362f33

* fix dtypes for torch * Add per minibatch timer * cleanup

fix restart epoch bug in trainer (LBANN#72)

362282f

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>

Update scaffold-tuolumne.job (LBANN#74)

5231a8f

Use snapshot to prevent warmup from affecting training and refactor w…

55292fd

…armup (LBANN#68) * Use snapshot to prevent warmup from influencing training * Fix validation warmup and increase default warmup * Refactor shared trainer logic in warmup * better default

shrink gitobjects and update gitignore

bf7a135

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up git history to shrink repo size#76

Clean up git history to shrink repo size#76
PatrickRMiles wants to merge 43 commits into
LBANN:mainfrom
PatrickRMiles:miles30/shrink_gitobjects

PatrickRMiles commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PatrickRMiles commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants