Skip to content

Use snapshot to prevent warmup from affecting training and refactor warmup#68

Merged
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:fix-warmup
Jun 4, 2026
Merged

Use snapshot to prevent warmup from affecting training and refactor warmup#68
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:fix-warmup

Conversation

@michaelmckinsey1

@michaelmckinsey1 michaelmckinsey1 commented May 7, 2026

Copy link
Copy Markdown
Collaborator
  • Fixes issue where warmup was affecting training by using snapshot to restore model state to that before warmup.
  • Fixes issue where validation was not being warmed up properly.
  • Refactor shared logic

With this configuration, runtime of the first epoch should now be similar to subsequent epochs.

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 7, 2026
@michaelmckinsey1 michaelmckinsey1 changed the title Use snapshot to prevent warmup from affecting training Use snapshot to prevent warmup from affecting training and refactor warmup May 7, 2026
@michaelmckinsey1 michaelmckinsey1 linked an issue May 7, 2026 that may be closed by this pull request

@PatrickRMiles PatrickRMiles left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot is a cool approach, but I wonder if we can do this simpler by just creating another copy of the model and using that separate copy for warmup. We'd create two separate models the same way in worker.py, warmup on one, then train on the other. This would achieve what we want -- preventing warmup from impacting the model state before training -- without needing this snapshot logic. Worth testing at least before we merge this

@michaelmckinsey1

Copy link
Copy Markdown
Collaborator Author

The snapshot is a cool approach, but I wonder if we can do this simpler by just creating another copy of the model and using that separate copy for warmup. We'd create two separate models the same way in worker.py, warmup on one, then train on the other. This would achieve what we want -- preventing warmup from impacting the model state before training -- without needing this snapshot logic. Worth testing at least before we merge this

I did test that copying the model using copy.deepcopy would address the issue with model state. However, the other components like the optimizer, scheduler, and checkpoint manager would not be reverted. Additionally, I'm not sure what the impact would be for creating a new model object with a different address, as I think the pytorch components are tied to the original model object.

@PatrickRMiles PatrickRMiles self-requested a review June 4, 2026 21:20

@PatrickRMiles PatrickRMiles left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've decided to keep the snapshot approach. This looks good!

@michaelmckinsey1 michaelmckinsey1 merged commit f330ec3 into LBANN:main Jun 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor common logic in warmup and train bodies in trainer.py

2 participants