Make model tests more scalable#154
Merged
Merged
Conversation
jla-gardner
approved these changes
Jun 9, 2026
jla-gardner
approved these changes
Jun 9, 2026
jla-gardner
approved these changes
Jun 9, 2026
mrwnmsr
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With the upcoming integration of new models from RetroChimera, I found that the way we currently run single-step models tests does not scale, especially when larger models are included. First, we've been running tests for all models in a single process, which meant all models are loaded simultaneously; this is inherently bounded by total memory available on the CI runners. Second, downloading many heavy model checkpoints is slow, and has a large variance. This PR attempts to resolve both issues to enable scaling to more single-step models in the future.
To avoid keeping all models in memory simultaneously, one could try forcing garbage collection, but this does not clean up all state, especially for models that involve multiprocessing or call into non-Python-native libraries. This PR instead proposes to run the tests in separate processes, which ensures all state is cleaned up. Previously, in
test_models.pywe used to test every model twice, to also test that loading one model doesn't make another model unusable; in the new setup, this wouldn't make sense, as models are now completely separate. We do lose a bit of test coverage as we do not test interactions between the models, but doing so exhaustively would not scale anyway; also, for the majority of usecases, only one model is used at a time. Finally, to reduce the burden of downloading model checkpoints, I use thecacheaction to cache the checkpoint directory (keyed on the contents ofdefault_checkpoint_ids.yml), so that CI runs that don't add new models or change existing ones can benefit from much faster checkpoint download.