Skip to content

Tech-debt batch 1: CI gate, custom-NNP repair, OOM resilience, lint/CLI fixes#94

Merged
isayev merged 9 commits into
mainfrom
tech-debt/batch-1
Jun 12, 2026
Merged

Tech-debt batch 1: CI gate, custom-NNP repair, OOM resilience, lint/CLI fixes#94
isayev merged 9 commits into
mainfrom
tech-debt/batch-1

Conversation

@isayev

@isayev isayev commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Implements the first batch of the post-hardening tech-debt roadmap
(docs/superpowers/plans/2026-06-11-tech-debt-backlog-roadmap.md). Six
independent workstreams, each TDD'd with a regression test that fails without
the fix. Release item excluded per request.

Workstreams

  • W3 — lint debt (style): cleared the ruff import-sort + zip-strict + quoted-annotation debt in previously-untouched modules (thermo.py, batchopt.py, padding.py, model_wrapper.py, models/__init__.py, utils/__init__.py). src/Auto3D/ is now fully ruff-clean.
  • W1 — test CI gate (ci): new .github/workflows/tests.yml runs the suite on push/PR with a matrix without the ani extra (the 8 torchani tests skip) and with it (those tests run for the first time in CI), plus a ruff + mypy job. Previously there was no CI running the tests.
  • W2 — custom-NNP repair (fix): the custom-NNP feature was broken — CustomModelAdapter only did torch.jit.load, and the example tests called torch.jit.script(...) on a model embedding AIMNet2, which aimnet no longer supports scripting. Now the adapter (and check_input) load eager OR scripted models, auto-detecting. The AIMNet2-based example tests save eager (torch.save) and run. New fast, ungated regression test (tests/test_custom_nnp_eager.py).
  • W4 — models info (fix): any aimnet2-* registry name (including future variants without their own info block) resolves to the AIMNet2 entry instead of "Unknown engine"; removed the stale --use-ensemble note (no such flag; ensemble removed).
  • W5 — OOM resilience (fix): forward_batched now recovers from a CUDA OOM by retrying the failing slice with a halved batch; a single molecule that still OOMs raises a clear OptimizationError instead of crashing opaquely.
  • W6 — input_format single source of truth (refactor): dropped the redundant WorkflowOrchestrator.input_format instance attribute; everything reads the config field (the Hardening: per-module audit + deferred-gaps remediation #93 architect minor).

Worth a closer look (W2 judgment call)

Repairing the custom-NNP tests surfaced a pre-existing example flaw: userNNP2 wraps the bare aimnet model (create_model("AIMNET").model), which omits the calculator's external D3 dispersion (~0.09 Ha for cyclooctane). So it does not match the D3-inclusive reference the old (crashing) tests hardcoded. I loosened those example-test tolerances (0.01→0.2 Ha SPE, 0.02→0.2 thermo) so they verify "the custom pipeline loads + runs + produces a sane energy" rather than exact DFT agreement (a NaN still fails the bound). The strict test_calc_spe_aimnet / test_calc_thermo_aimnet (full calculator, with D3) keep their tight tolerances. Making the example itself D3-correct is noted as a follow-up.

Test Plan

  • pytest tests/ -q -m "not slow" → 628 passed, 8 skipped, 0 failed, 0 errors (without the ani extra).
  • ruff check src/Auto3D/ → clean.
  • CI [ani] job: installs torchani and runs the 8 otherwise-skipped ANI tests + the slow custom-NNP tests.
  • pytest tests/test_SPE.py::test_calc_spe_userNNP2 -m slow passes (verified locally; the previously-crashing custom-NNP path now runs).

Remaining roadmap items (W7 docs refresh, W8 ANI2xt fp64, W9 benchmark harness) are deferred to follow-up PRs.

isayev added 8 commits June 11, 2026 21:51
…stom-NNP tests

CustomModelAdapter and check_input now load a custom NNP as either a
TorchScript archive (torch.jit.load) OR an eager nn.Module (torch.load),
auto-detecting. Modern AIMNet2-based models are no longer torch.jit.script-able
(aimnet dropped scripting support), which is why the userNNP2/userNNP3 tests
crashed at torch.jit.script before Auto3D ever saw the file.

The AIMNet2-based example tests now save eager (torch.save) instead of
torch.jit.script and run. They wrap the BARE aimnet model (create_model.model),
which omits the calculator's external D3 dispersion (~0.09 Ha), so their
tolerance is loosened from exact-DFT agreement to 'pipeline runs + sane energy';
the torchani userNNP1 tests keep jit.script (still scriptable, torchani-gated).

Adds a fast, ungated regression test (tests/test_custom_nnp_eager.py) proving
the eager-load path with a trivial analytic module (no aimnet/torchani).
Documents the eager-or-scripted contract and the species_pad=0 NaN caveat.
…ler batch

forward_batched computed batch_size = batchsize_atoms // N once; a transient
CUDA OOM crashed the whole run. Now each slice is retried with a halved batch
on OOM (after empty_cache); a single molecule that still OOMs raises a clear
OptimizationError instead of an opaque crash. Order is preserved (slices and
sub-slices stay ascending). Adds a regression test that OOMs above batch 1.
… orchestrator attr)

input_format was tracked both as a WorkflowOrchestrator instance attribute and
as the config field, kept in sync by _validate_input writing both -- a desync
footgun now that it is a declared, pickled config field. Drop the instance attr;
read self.config.input_format everywhere. Adds a regression test.
…tor, not the bare model)

The userNNP2 example wrapped the bare aimnet model (create_model.model), which
omits the calculator's external D3 dispersion (~0.09 Ha), so its energies didn't
match the D3-inclusive reference. Rewrite it to wrap the AIMNet2 *calculator*
(D3 + Coulomb) with ragged mol_idx batching, energy-only (forces=False, so the
custom adapter still autograds forces; the calculator preserves the graph when
coord requires grad). The calculator is built lazily on the input device because
it freezes its device at construction -- this lets the saved model round-trip
through torch.save and run on multi-GPU. Tolerances restored to strict (0.01 SPE,
0.02 thermo).

Two more load/introspection sites hardened for this path:
- ASE/thermo.py _load_hessian_model: load custom NNPs eager-or-scripted (mirrors
  the CustomModelAdapter fix).
- ASE/thermo.py Calculator.__init__: tolerate a param-less custom model (one that
  builds its backend lazily) for device/dtype detection.

Verified: test_calc_spe_userNNP2 (0.01) and test_calc_thermo_userNNP2 (0.02) now
pass with strict tolerances (D3-inclusive); fast suite 628 passed, src ruff-clean.
Addresses PR #94 multi-review:
- Critical: the new tests.yml installed only .[ase], so pytest was missing and
  every CI run died with 'pytest: command not found'. Install .[ase,dev].
- Major (DRY/contract drift): the 'TorchScript-or-eager nn.Module' load logic
  was triplicated (CustomModelAdapter, check_input, _load_hessian_model) with
  subtle divergence -- _load_hessian_model lacked the nn.Module guard (a bare
  state_dict raised a raw AttributeError) and adapter/thermo leaked
  UnpicklingError where validation raised ModelLoadError. Extracted a single
  Auto3D.models.loading.load_custom_nnp(path, device, *, double=False) that all
  three call; it always returns an nn.Module on device or raises ModelLoadError.

Adds helper tests (eager load, double cast, state_dict -> ModelLoadError,
garbage -> ModelLoadError). Slow custom-NNP D3 tests still pass through the
shared loader; fast suite 631 passed, src ruff-clean.
…st gate)

The first CI run (no GPU, FORCE_COLOR) surfaced three latent test failures that
an 8-GPU dev box masks:
- test_workflow.py: two tests built Auto3DOptions with the default use_gpu=True,
  so check_input raised GPUError on the GPU-less runner. Set use_gpu=False (one
  is the new W6 test, one pre-existing on main).
- test_cli.py::test_run_subcommand_help: rich colorizes help under FORCE_COLOR,
  styling '--' separately from the option name and splitting the literal
  '--engine' across ANSI codes. Strip ANSI before the substring asserts.

Verified locally under CUDA_VISIBLE_DEVICES='' FORCE_COLOR=1: 631 passed, 0 failed.
@isayev isayev merged commit cc70be9 into main Jun 12, 2026
7 of 8 checks passed
@isayev isayev deleted the tech-debt/batch-1 branch June 12, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant