Clean up git history to shrink repo size#76
Open
PatrickRMiles wants to merge 43 commits into
Open
Conversation
* fix cli bug: must recalculate unet_layers in CLI since problem_scale can be overwritten iwhtout changing config object unet_layers * whitespace * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* Continue if checkpointing fails * Fixes for new distconv * enable running 1 epoch * Update trainer.py * Update trainer.py
* set checkpoint interval * truncate stats csv when loading from checkpoint if checkpoint is behind latest CSV entries * lint --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* add -l to torchrun-hpc restart command, preventing nested dir creation; also simplify restart script to more closely match default run method like ScaFFold/scripts/scaffold-tuolumne.job * give restarting torchrun the full path to the existing dir * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* init * debug * testing * Enable configuring n_categories * set checkpoint interval * cleanup * lint * Update trainer.py * Create benchmark_testing.yml
* Update versions and pin distconv & ccl. Add separate install for pypi * Ensure libfabric * Don't need spindle off anymore * Enforce cray-mpich 9.1.0 * patch all so files
* cd back after done * Do not set var. This will be set by hpclauncher
* remove open3d dependency * ruff * comments --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* update config with 3D num_shards and shard_dim * update config util to expect 3D num_shards and shard_dim, add helper to deal with 1D inputs * worker no longer needs to modify distconv params set in config -- just pass them as-is to the ParallelStrategy call * implement multi-dimensional sharding for distconv * update distconv param name scheme * fix loss calc * add sharded dice loss calculation to dice score util * update evaluate to use sharded dice loss calc * update trainer to use new evaluate; other small fixes/tweaks * fix assert * fix naming * fix naming * better default values * missing import * update distconv param names, default vals * use np.prod instead of math.prod * ruff * import math * warmup logging and timing * import math for prod * Add missing import * Remove extra func make cli arg tuple * lint * lint * Update configs --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>
* fix unet bottleneck dim off by 1 error * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job * Update scaffold-tuolumne-torchpypi.job * Update scaffold-tuolumne.job
…e clipping (LBANN#40) * apply optimizer every batch, not every epoch; unscale gradients before clipping * trainer tweaks --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
…n trainer class (LBANN#43) * apply optimizer every batch, not every epoch; unscale gradients before clipping * trainer tweaks * apply optimizer every batch, not every epoch; unscale gradients before clipping * extract warmup to separate method; switch to warming up set number of batches (user configurable) * whitespace; num_workers revert * ruff * make parallelstrategy, spatial_mesh, ddp_placements attrs of trainer; other small tweaks * remove deprecated config attrs * ruff * get device mesh from ps class attr * ruff * missing self. on some ps accesses * Fix imports and missing self.ps * rm legacy warmup_epochs * Move attributes to base class for clarity * remove warmup_epochs -- not useful to keep support for this * call cleanup_or_resume trainer method directly * rm unused vars --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <michaelmckinsey1@gmail.com>
- make dataloader num_workers user-configurable
- shift dataloader preprocessing work into dataset generation for speedup, maintaining
support for old datasets
- data_loading.py: restore .contiguous() and dtype cast calls, but change order to avoid
redundant copies
* Warmup evaluation * cleanup
* enable option to never checkpoint and make default * Update config_utils.py
* check for nan dice_score_train * lint
* move to sharded data loading * bug fixes * ruff * restore missing import * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
* ignore background class * .item()
* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * Update trainer.py * Update trainer.py * Update trainer.py * Refactor * mv .item()
* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * cosine sched * Validation needs smaller batch size otherwise val_dice can be 0. * config * Update benchmark_default.yml * lint * undo val_batch_size * Fix dtypes * README
* bf16 and more fp32 sections for dice * Refactor * ruff * fix merge artifact * cosine sched * Validation needs smaller batch size otherwise val_dice can be 0. * config * Update benchmark_default.yml * fix fractal base dir * fix config * lint * Update get_dataset.py
* use class weights in CE loss to make background less dominant; calc weights at trainer init * ruff * missing import * ruff * ruff * fix missing volume dtype * remove default ce_weight_num_samples Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com> * remove default ce_weight_num_samples in trainer Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com> * move ce loss helpers to losses.py * sample by fraction of total rather than hard number * ruff --------- Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov> Co-authored-by: Michael McKinsey <MichaelMcKinsey1@gmail.com>
* working rocm/7.2.1 * Change wheel source and use torch2.12 * Update install-tuolumne-torchpypi.sh * Update scaffold-tuolumne-torchpypi.job
* Add adiak metadata * Update worker.py
* fix dtypes for torch * Add per minibatch timer * cleanup
Co-authored-by: Patrick Miles <miles30@tioga.llnl.gov>
…armup (LBANN#68) * Use snapshot to prevent warmup from influencing training * Fix validation warmup and increase default warmup * Refactor shared trainer logic in warmup * better default
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Removes ~1.2GB of accidentally committed .npy files and virtual environment binaries from git history using git filter-branch. Updated .gitignore to prevent future commits of these files. Repository size reduced from 1.2GB to 12MB.