Use snapshot to prevent warmup from affecting training and refactor warmup by michaelmckinsey1 · Pull Request #68 · LBANN/ScaFFold

michaelmckinsey1 · 2026-05-07T22:57:31Z

Fixes issue where warmup was affecting training by using snapshot to restore model state to that before warmup.
Fixes issue where validation was not being warmed up properly.
Refactor shared logic

With this configuration, runtime of the first epoch should now be similar to subsequent epochs.

PatrickRMiles

The snapshot is a cool approach, but I wonder if we can do this simpler by just creating another copy of the model and using that separate copy for warmup. We'd create two separate models the same way in worker.py, warmup on one, then train on the other. This would achieve what we want -- preventing warmup from impacting the model state before training -- without needing this snapshot logic. Worth testing at least before we merge this

michaelmckinsey1 · 2026-06-04T20:10:01Z

The snapshot is a cool approach, but I wonder if we can do this simpler by just creating another copy of the model and using that separate copy for warmup. We'd create two separate models the same way in worker.py, warmup on one, then train on the other. This would achieve what we want -- preventing warmup from impacting the model state before training -- without needing this snapshot logic. Worth testing at least before we merge this

I did test that copying the model using copy.deepcopy would address the issue with model state. However, the other components like the optimizer, scheduler, and checkpoint manager would not be reverted. Additionally, I'm not sure what the impact would be for creating a new model object with a different address, as I think the pytorch components are tied to the original model object.

PatrickRMiles

We've decided to keep the snapshot approach. This looks good!

michaelmckinsey1 self-assigned this May 7, 2026

michaelmckinsey1 changed the title ~~Use snapshot to prevent warmup from affecting training~~ Use snapshot to prevent warmup from affecting training and refactor warmup May 7, 2026

michaelmckinsey1 linked an issue May 7, 2026 that may be closed by this pull request

Refactor common logic in warmup and train bodies in trainer.py #45

Closed

michaelmckinsey1 added 4 commits May 14, 2026 13:27

Use snapshot to prevent warmup from influencing training

0b642e2

Fix validation warmup and increase default warmup

549ec86

Refactor shared trainer logic in warmup

f91a65a

better default

5582afb

michaelmckinsey1 force-pushed the fix-warmup branch from 85caa27 to 5582afb Compare May 14, 2026 20:27

michaelmckinsey1 requested a review from PatrickRMiles May 19, 2026 23:32

Merge remote-tracking branch 'origin/main' into fix-warmup

47e82d0

PatrickRMiles requested changes Jun 4, 2026

View reviewed changes

PatrickRMiles self-requested a review June 4, 2026 21:20

PatrickRMiles approved these changes Jun 4, 2026

View reviewed changes

michaelmckinsey1 merged commit f330ec3 into LBANN:main Jun 4, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use snapshot to prevent warmup from affecting training and refactor warmup#68

Use snapshot to prevent warmup from affecting training and refactor warmup#68
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:fix-warmup

michaelmckinsey1 commented May 7, 2026 •

edited

Loading

Uh oh!

PatrickRMiles left a comment

Uh oh!

michaelmckinsey1 commented Jun 4, 2026

Uh oh!

PatrickRMiles left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelmckinsey1 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PatrickRMiles left a comment

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 commented Jun 4, 2026

Uh oh!

PatrickRMiles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelmckinsey1 commented May 7, 2026 •

edited

Loading