Skip to content

Clean up git history to shrink repo size#76

Open
PatrickRMiles wants to merge 43 commits into
LBANN:mainfrom
PatrickRMiles:miles30/shrink_gitobjects
Open

Clean up git history to shrink repo size#76
PatrickRMiles wants to merge 43 commits into
LBANN:mainfrom
PatrickRMiles:miles30/shrink_gitobjects

Conversation

@PatrickRMiles

Copy link
Copy Markdown
Collaborator

Removes ~1.2GB of accidentally committed .npy files and virtual environment binaries from git history using git filter-branch. Updated .gitignore to prevent future commits of these files. Repository size reduced from 1.2GB to 12MB.

michaelmckinsey1 and others added 30 commits January 22, 2026 09:04
* Update pyproject.toml

* Update requirements.txt
* fix cli bug: must recalculate unet_layers in CLI since problem_scale can be overwritten iwhtout changing config object unet_layers

* whitespace

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* Continue if checkpointing fails

* Fixes for new distconv

* enable running 1 epoch

* Update trainer.py

* Update trainer.py
* set checkpoint interval

* truncate stats csv when loading from checkpoint if checkpoint is behind latest CSV entries

* lint

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* add -l to torchrun-hpc restart command, preventing nested dir creation; also simplify restart script to more closely match default run method like ScaFFold/scripts/scaffold-tuolumne.job

* give restarting torchrun the full path to the existing dir

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* init

* debug

* testing

* Enable configuring n_categories

* set checkpoint interval

* cleanup

* lint

* Update trainer.py

* Create benchmark_testing.yml
* Update versions and pin distconv & ccl. Add separate install for pypi

* Ensure libfabric

* Don't need spindle off anymore

* Enforce cray-mpich 9.1.0

* patch all so files
* cd back after done

* Do not set var. This will be set by hpclauncher
* Add num-shards to cli

* lint

* Update cli.py
* remove open3d dependency

* ruff

* comments

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* update config with 3D num_shards and shard_dim

* update config util to expect 3D num_shards and shard_dim, add helper to deal with 1D inputs

* worker no longer needs to modify distconv params set in config -- just pass them as-is to the ParallelStrategy call

* implement multi-dimensional sharding for distconv

* update distconv param name scheme

* fix loss calc

* add sharded dice loss calculation to dice score util

* update evaluate to use sharded dice loss calc

* update trainer to use new evaluate; other small fixes/tweaks

* fix assert

* fix naming

* fix naming

* better default values

* missing import

* update distconv param names, default vals

* use np.prod instead of math.prod

* ruff

* import math

* warmup logging and timing

* import math for prod

* Add missing import

* Remove extra func make cli arg tuple

* lint

* lint

* Update configs

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>
* fix unet bottleneck dim off by 1 error

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* Update scaffold-tuolumne.job

* Update scaffold-tuolumne-torchpypi.job

* Update scaffold-tuolumne.job

* Update scaffold-tuolumne-torchpypi.job

* Update scaffold-tuolumne-torchpypi.job

* Update scaffold-tuolumne.job

* Update scaffold-tuolumne-torchpypi.job

* Update scaffold-tuolumne.job
…e clipping (LBANN#40)

* apply optimizer every batch, not every epoch; unscale gradients before clipping

* trainer tweaks

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
…n trainer class (LBANN#43)

* apply optimizer every batch, not every epoch; unscale gradients before clipping

* trainer tweaks

* apply optimizer every batch, not every epoch; unscale gradients before clipping

* extract warmup to separate method; switch to warming up set number of batches (user configurable)

* whitespace; num_workers revert

* ruff

* make parallelstrategy, spatial_mesh, ddp_placements attrs of trainer; other small tweaks

* remove deprecated config attrs

* ruff

* get device mesh from ps class attr

* ruff

* missing self. on some ps accesses

* Fix imports and missing self.ps

* rm legacy warmup_epochs

* Move attributes to base class for clarity

* remove warmup_epochs -- not useful to keep support for this

* call cleanup_or_resume trainer method directly

* rm unused vars

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>
- make dataloader num_workers user-configurable
- shift dataloader preprocessing work into dataset generation for speedup, maintaining      
    support for old datasets
- data_loading.py: restore .contiguous() and dtype cast calls, but change order to avoid    
    redundant copies
* Fix check

* Fix flake
* Calculate local CE loss w/o AMP to prevent inf from f16 overflow

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* Warmup evaluation

* cleanup
* enable option to never checkpoint and make default

* Update config_utils.py
* check for nan dice_score_train

* lint
* move to sharded data loading

* bug fixes

* ruff

* restore missing import

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
michaelmckinsey1 and others added 13 commits April 23, 2026 14:59
* bf16 and more fp32 sections for dice

* Refactor

* ruff

* fix merge artifact

* Update trainer.py

* Update trainer.py

* Update trainer.py

* Refactor

* mv .item()
* bf16 and more fp32 sections for dice

* Refactor

* ruff

* fix merge artifact

* cosine sched

* Validation needs smaller batch size otherwise val_dice can be 0.

* config

* Update benchmark_default.yml

* lint

* undo val_batch_size

* Fix dtypes

* README
* bf16 and more fp32 sections for dice

* Refactor

* ruff

* fix merge artifact

* cosine sched

* Validation needs smaller batch size otherwise val_dice can be 0.

* config

* Update benchmark_default.yml

* fix fractal base dir

* fix config

* lint

* Update get_dataset.py
* use class weights in CE loss to make background less dominant; calc weights at trainer init

* ruff

* missing import

* ruff

* ruff

* fix missing volume dtype

* remove default ce_weight_num_samples

Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com>

* remove default ce_weight_num_samples in trainer

Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com>

* move ce loss helpers to losses.py

* sample by fraction of total rather than hard number

* ruff

---------

Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com>
* working rocm/7.2.1

* Change wheel source and use torch2.12

* Update install-tuolumne-torchpypi.sh

* Update scaffold-tuolumne-torchpypi.job
* Add adiak metadata

* Update worker.py
* fix dtypes for torch

* Add per minibatch timer

* cleanup
Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
…armup (LBANN#68)

* Use snapshot to prevent warmup from influencing training

* Fix validation warmup and increase default warmup

* Refactor shared trainer logic in warmup

* better default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants