Tech-debt batch 1: CI gate, custom-NNP repair, OOM resilience, lint/CLI fixes#94
Merged
Conversation
…stom-NNP tests CustomModelAdapter and check_input now load a custom NNP as either a TorchScript archive (torch.jit.load) OR an eager nn.Module (torch.load), auto-detecting. Modern AIMNet2-based models are no longer torch.jit.script-able (aimnet dropped scripting support), which is why the userNNP2/userNNP3 tests crashed at torch.jit.script before Auto3D ever saw the file. The AIMNet2-based example tests now save eager (torch.save) instead of torch.jit.script and run. They wrap the BARE aimnet model (create_model.model), which omits the calculator's external D3 dispersion (~0.09 Ha), so their tolerance is loosened from exact-DFT agreement to 'pipeline runs + sane energy'; the torchani userNNP1 tests keep jit.script (still scriptable, torchani-gated). Adds a fast, ungated regression test (tests/test_custom_nnp_eager.py) proving the eager-load path with a trivial analytic module (no aimnet/torchani). Documents the eager-or-scripted contract and the species_pad=0 NaN caveat.
…ler batch forward_batched computed batch_size = batchsize_atoms // N once; a transient CUDA OOM crashed the whole run. Now each slice is retried with a halved batch on OOM (after empty_cache); a single molecule that still OOMs raises a clear OptimizationError instead of an opaque crash. Order is preserved (slices and sub-slices stay ascending). Adds a regression test that OOMs above batch 1.
… orchestrator attr) input_format was tracked both as a WorkflowOrchestrator instance attribute and as the config field, kept in sync by _validate_input writing both -- a desync footgun now that it is a declared, pickled config field. Drop the instance attr; read self.config.input_format everywhere. Adds a regression test.
…tor, not the bare model) The userNNP2 example wrapped the bare aimnet model (create_model.model), which omits the calculator's external D3 dispersion (~0.09 Ha), so its energies didn't match the D3-inclusive reference. Rewrite it to wrap the AIMNet2 *calculator* (D3 + Coulomb) with ragged mol_idx batching, energy-only (forces=False, so the custom adapter still autograds forces; the calculator preserves the graph when coord requires grad). The calculator is built lazily on the input device because it freezes its device at construction -- this lets the saved model round-trip through torch.save and run on multi-GPU. Tolerances restored to strict (0.01 SPE, 0.02 thermo). Two more load/introspection sites hardened for this path: - ASE/thermo.py _load_hessian_model: load custom NNPs eager-or-scripted (mirrors the CustomModelAdapter fix). - ASE/thermo.py Calculator.__init__: tolerate a param-less custom model (one that builds its backend lazily) for device/dtype detection. Verified: test_calc_spe_userNNP2 (0.01) and test_calc_thermo_userNNP2 (0.02) now pass with strict tolerances (D3-inclusive); fast suite 628 passed, src ruff-clean.
Addresses PR #94 multi-review: - Critical: the new tests.yml installed only .[ase], so pytest was missing and every CI run died with 'pytest: command not found'. Install .[ase,dev]. - Major (DRY/contract drift): the 'TorchScript-or-eager nn.Module' load logic was triplicated (CustomModelAdapter, check_input, _load_hessian_model) with subtle divergence -- _load_hessian_model lacked the nn.Module guard (a bare state_dict raised a raw AttributeError) and adapter/thermo leaked UnpicklingError where validation raised ModelLoadError. Extracted a single Auto3D.models.loading.load_custom_nnp(path, device, *, double=False) that all three call; it always returns an nn.Module on device or raises ModelLoadError. Adds helper tests (eager load, double cast, state_dict -> ModelLoadError, garbage -> ModelLoadError). Slow custom-NNP D3 tests still pass through the shared loader; fast suite 631 passed, src ruff-clean.
2 tasks
…st gate) The first CI run (no GPU, FORCE_COLOR) surfaced three latent test failures that an 8-GPU dev box masks: - test_workflow.py: two tests built Auto3DOptions with the default use_gpu=True, so check_input raised GPUError on the GPU-less runner. Set use_gpu=False (one is the new W6 test, one pre-existing on main). - test_cli.py::test_run_subcommand_help: rich colorizes help under FORCE_COLOR, styling '--' separately from the option name and splitting the literal '--engine' across ANSI codes. Strip ANSI before the substring asserts. Verified locally under CUDA_VISIBLE_DEVICES='' FORCE_COLOR=1: 631 passed, 0 failed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the first batch of the post-hardening tech-debt roadmap
(
docs/superpowers/plans/2026-06-11-tech-debt-backlog-roadmap.md). Sixindependent workstreams, each TDD'd with a regression test that fails without
the fix. Release item excluded per request.
Workstreams
style): cleared the ruff import-sort + zip-strict + quoted-annotation debt in previously-untouched modules (thermo.py,batchopt.py,padding.py,model_wrapper.py,models/__init__.py,utils/__init__.py).src/Auto3D/is now fully ruff-clean.ci): new.github/workflows/tests.ymlruns the suite on push/PR with a matrix without theaniextra (the 8 torchani tests skip) and with it (those tests run for the first time in CI), plus aruff+mypyjob. Previously there was no CI running the tests.fix): the custom-NNP feature was broken —CustomModelAdapteronly didtorch.jit.load, and the example tests calledtorch.jit.script(...)on a model embedding AIMNet2, which aimnet no longer supports scripting. Now the adapter (andcheck_input) load eager OR scripted models, auto-detecting. The AIMNet2-based example tests save eager (torch.save) and run. New fast, ungated regression test (tests/test_custom_nnp_eager.py).models info(fix): anyaimnet2-*registry name (including future variants without their own info block) resolves to the AIMNet2 entry instead of "Unknown engine"; removed the stale--use-ensemblenote (no such flag; ensemble removed).fix):forward_batchednow recovers from a CUDA OOM by retrying the failing slice with a halved batch; a single molecule that still OOMs raises a clearOptimizationErrorinstead of crashing opaquely.input_formatsingle source of truth (refactor): dropped the redundantWorkflowOrchestrator.input_formatinstance attribute; everything reads the config field (the Hardening: per-module audit + deferred-gaps remediation #93 architect minor).Worth a closer look (W2 judgment call)
Repairing the custom-NNP tests surfaced a pre-existing example flaw:
userNNP2wraps the bare aimnet model (create_model("AIMNET").model), which omits the calculator's external D3 dispersion (~0.09 Ha for cyclooctane). So it does not match the D3-inclusive reference the old (crashing) tests hardcoded. I loosened those example-test tolerances (0.01→0.2 Ha SPE, 0.02→0.2 thermo) so they verify "the custom pipeline loads + runs + produces a sane energy" rather than exact DFT agreement (a NaN still fails the bound). The stricttest_calc_spe_aimnet/test_calc_thermo_aimnet(full calculator, with D3) keep their tight tolerances. Making the example itself D3-correct is noted as a follow-up.Test Plan
pytest tests/ -q -m "not slow"→ 628 passed, 8 skipped, 0 failed, 0 errors (without theaniextra).ruff check src/Auto3D/→ clean.[ani]job: installs torchani and runs the 8 otherwise-skipped ANI tests + the slow custom-NNP tests.pytest tests/test_SPE.py::test_calc_spe_userNNP2 -m slowpasses (verified locally; the previously-crashing custom-NNP path now runs).Remaining roadmap items (W7 docs refresh, W8 ANI2xt fp64, W9 benchmark harness) are deferred to follow-up PRs.