Skip to content

feat: add Parsl-based parallel build orchestration#17

Open
maxscheurer wants to merge 10 commits into
mainfrom
feature/issue-16-parsl-build-orchestration
Open

feat: add Parsl-based parallel build orchestration#17
maxscheurer wants to merge 10 commits into
mainfrom
feature/issue-16-parsl-build-orchestration

Conversation

@maxscheurer

Copy link
Copy Markdown
Collaborator

Closes #16 (PR 1 of 4)

Add Parsl as a Python-native workflow manager for parallel build orchestration on SLURM GPU nodes. Extends mdfactory build to accept CSV files directly and dispatch builds in parallel via Parsl's @python_app + HighThroughputExecutor.

Implementation plan posted as a comment below.

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Implementation Plan

Problem

Build orchestration requires Nextflow (build.nf) to parallelize mdfactory build on SLURM GPU nodes. This is the only non-Python tool in the critical build path. We need a Python-native replacement using Parsl that integrates directly into the existing mdfactory build command.

Approach

Use @python_app wrapping run_build_from_dict() — pass a plain dict (from BuildInput.model_dump()), let the Parsl worker do all heavy imports (OpenMM, MDAnalysis, OpenFF) internally. No serialization of scientific objects across processes. No @bash_app / shell escaping needed.

Extend mdfactory build to handle CSV input directly — absorb the prepare→build two-step into one command when running parallel builds. Three input modes:

Input Behavior
Single YAML Build one system locally (current behavior, unchanged)
CSV file Create BuildInput models + per-hash dirs, dispatch all builds (parallel with --config, sequential without)
Summary YAML or directory of YAMLs Dispatch existing prepared builds

Deliverables

New module: mdfactory/orchestration/

File Purpose
__init__.py Public API: ExecutorConfig, SlurmExecutorConfig, build_systems, build_systems_dry_run
config.py Pydantic models for executor configuration → parsl.Config objects
apps.py @python_app definition: build_system_app(build_input_dict) wrapping run_build_from_dict
build.py build_systems() — loads Parsl, submits parallel builds, returns futures. build_systems_dry_run() — prints what would run without submitting

New test files

File Coverage
mdfactory/tests/test_orchestration_config.py Config model construction, validation, to_parsl_config() with mocked Parsl
mdfactory/tests/test_orchestration_build.py Build dispatch with mocked Parsl, dry-run output, CSV/YAML/directory input handling, error paths

Modified files

File Change
pyproject.toml Add parsl = ["parsl>=2024.1.1"] to [project.optional-dependencies]
mdfactory/cli.py Extend build command: accept CSV input, add --config (executor YAML) and --dry-run flags. Reuse prepare.py internals for CSV→BuildInput conversion

Design Details

config.py — Executor configuration:

class ExecutorConfig(BaseModel):
    """Base executor configuration."""
    provider: Literal["local", "slurm"] = "local"
    worker_init: str = ""         # module loads, conda/pixi activation
    working_directory: Path | None = None

class SlurmExecutorConfig(ExecutorConfig):
    """SLURM-specific executor configuration."""
    provider: Literal["slurm"] = "slurm"
    account: str                  # required for SLURM
    partition: str = "gpu"
    walltime: str = "2:00:00"
    nodes_per_block: int = 1
    cores_per_node: int = 12
    gpus_per_node: int = 1
    scheduler_options: str = ""   # extra #SBATCH lines

    def to_parsl_config(self) -> "parsl.Config":
        """Build a Parsl Config with HighThroughputExecutor + SlurmProvider."""
        ...

Follows the SlurmConfig dataclass pattern from analysis/submit.py.
Config can be loaded from a YAML file for reuse across campaigns.

apps.py — Parsl app:

@python_app
def build_system_app(build_input_dict: dict):
    """Run a single system build inside a Parsl worker.

    All heavy imports (OpenMM, MDAnalysis, OpenFF) happen inside the
    worker process — only the plain dict crosses the serialization boundary.
    """
    from mdfactory.workflows import run_build_from_dict
    return run_build_from_dict(build_input_dict)

build.py — Orchestration:

  • build_systems(build_inputs, config, *, wait=True) — accepts list of BuildInput or dicts, calls .model_dump() on each, submits via build_system_app, optionally waits for all futures
  • build_systems_dry_run(build_inputs, config) — prints each system's hash + simulation type + config summary without loading Parsl
  • Conditional import parsl inside function bodies (same pattern as submitit in submit.py)

CLI changes (cli.py):

The existing build command currently accepts a single YAML path. Extend it to:

mdfactory build <input>                     # single YAML → local build (unchanged)
mdfactory build <input.csv>                 # CSV → sequential local builds
mdfactory build <input.csv> --config e.yaml # CSV → parallel Parsl builds
mdfactory build <input.csv> --dry-run       # CSV → print what would run
mdfactory build <summary.yaml> --config ... # summary YAML → parallel builds

Input type detection by file extension (.csv vs .yaml/.yml). For CSV input, reuse df_to_build_input_models() from prepare.py and directory creation logic from prepare_build.

Testing Approach

test_orchestration_config.py:

  • ExecutorConfig and SlurmExecutorConfig construction with defaults
  • Validation: missing account for SLURM raises error
  • to_parsl_config() with Parsl mocked via monkeypatch.setitem(sys.modules, "parsl", ...) — verify correct provider/executor types
  • Config round-trip: write YAML → load → compare
  • worker_init string propagation into config

test_orchestration_build.py:

  • Mock Parsl with DummyDataFlowKernel capturing submitted app calls
  • build_systems() submits correct number of tasks
  • Each submitted task receives the correct build_input_dict
  • build_systems_dry_run() returns descriptions without loading Parsl
  • CSV input: creates directories + dispatches builds
  • Error handling: missing files, invalid BuildInput, failed futures
  • Import guard: mdfactory.orchestration importable without parsl installed

Acceptance Criteria

  • pip install 'mdfactory[parsl]' installs parsl
  • Importing mdfactory without parsl installed does not error
  • ExecutorConfig / SlurmExecutorConfig produce valid Parsl Config objects for local and SLURM providers
  • build_systems() submits parallel builds via @python_app wrapping run_build_from_dict
  • build_systems_dry_run() prints resolved build descriptions without submitting
  • mdfactory build input.csv --config executor.yaml reads CSV, creates directories, dispatches parallel builds
  • mdfactory build input.csv --dry-run prints what would be built without executing
  • mdfactory build single.yaml still works unchanged (backward compatible)
  • All new code has NumPy-style docstrings, passes ruff check + ruff format
  • Tests pass in CI without parsl installed (mocked)

Risks

  1. Worker environmentmodule load + pixi activation may behave differently inside Parsl workers. First HPC test will reveal this; worker_init config provides the escape hatch.
  2. Parsl version stability — Config API occasionally changes between major versions. Pin >=2024.1.1.
  3. Shared filesystem assumption — design assumes build outputs land on shared FS (standard for HPC). No file staging.
  4. prepare-build overlap — the CSV handling in build partially duplicates prepare-build. We should refactor shared logic into prepare.py helpers but keep prepare-build as a standalone command for users who want the two-step flow.

Plan created by mach6

- New mdfactory/orchestration/ module (config, apps, build)
- ExecutorConfig/SlurmExecutorConfig with SLURM-native field names
- build_systems() dispatches parallel builds via @python_app
- build_systems_dry_run() previews without loading Parsl
- Rich live progress display with activity log
- Ctrl+C triggers parsl.clear() to cancel SLURM jobs
- Extended CLI: mdfactory build accepts CSV/YAML with --config/--dry-run
- Added parsl to optional-dependencies and pixi dev env
- 19 tests covering config, build dispatch, and error handling
@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

Implementation complete

New module mdfactory/orchestration/:

  • config.pyExecutorConfig and SlurmExecutorConfig (Pydantic models with SLURM-native field names: gres, mem, qos, constraint, max_blocks)
  • apps.py@python_app wrapping run_build_from_dict (decoration at runtime, not import time)
  • build.pybuild_systems() for parallel dispatch with Rich live progress TUI, build_systems_dry_run() for preview
  • __init__.py — public API exports

CLI extension (cli.py):

  • mdfactory build now accepts CSV, single YAML, or summary YAML input
  • --config flag for parallel Parsl builds (local or SLURM)
  • --dry-run flag to preview without executing

Tests (19 passing):

  • test_orchestration_config.py — config defaults, YAML load/roundtrip, to_parsl_config(), scheduler options assembly
  • test_orchestration_build.py — dry-run, submission, error handling, no-wait mode, import guard

Infrastructure:

  • parsl>=2024.1.1 added to [project.optional-dependencies] and pixi dev env
  • Rich live progress display with activity log (scales to any number of builds)
  • Ctrl+C handling: parsl.clear() cancels SLURM jobs
  • min_blocks=0 ensures idle SLURM jobs are released

Commit: e9925d5


Progress tracked by mach6

- _shutdown_parsl() extracts SLURM job IDs from DFK and runs scancel
- Ctrl+C now reliably cancels SLURM jobs instead of relying on parsl.clear()
- Error messages no longer truncated in activity log
- Increased activity log to 12 entries
When species.resname is not 'SOL' (e.g. 'WAT'), Interchange would
generate atom types with the wrong prefix (WAT_0 instead of SOL_0),
causing a KeyError in OpenMM's GromacsTopFile parser.

Force molecule.name='SOL' before Interchange export so atom types are
always consistent (SOL_0, SOL_1, SOL_2). Also invalidate stale
SOL_params.itp when the ITP is regenerated.
@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

Bug fix: Force molecule.name = "SOL" for water SMIRNOFF parametrization to prevent atom type naming inconsistency.

Problem: When species.resname is "WAT" (or anything other than "SOL"), Interchange generated atom types with that prefix (WAT_0, WAT_1, WAT_2) in the params file, while the molecule ITP expected SOL_0 types. This caused KeyError: 'SOL_0' when OpenMM tried to load the GROMACS topology.

Fix: Override molecule.name = "SOL" before Interchange export so the cached water parameters are always internally consistent. Also delete stale SOL_params.itp when the ITP is regenerated to prevent cache inconsistencies.

Commit: 93b8491


Progress tracked by mach6

@maxscheurer maxscheurer marked this pull request as ready for review June 8, 2026 06:35
@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Code Review

Critical

No critical findings.

Important

Finding 1 — Parsl DFK leaks on exceptions after parsl.load() (confidence: 92)
mdfactory/orchestration/build.pybuild_systems() calls parsl.load() at the top but only calls _shutdown_parsl() at the bottom with no try/finally. If any exception occurs during the submission loop (e.g., ValidationError from a malformed BuildInput dict) or during _wait_with_progress (anything other than KeyboardInterrupt), the DFK stays loaded, SLURM jobs keep running, and the next call to build_systems() crashes with Parsl's "DFK already loaded" RuntimeError. Fix: wrap the post-parsl.load() body in try/finally: _shutdown_parsl().

Finding 2 — Top-level parsl imports break the optional-dependency pattern (confidence: 90)
orchestration/apps.py, orchestration/build.py, and orchestration/config.py all import parsl at module level. Since orchestration/__init__.py re-exports from these modules, any import from mdfactory.orchestration triggers ImportError when parsl isn't installed — even for code paths that never use Parsl (like --dry-run). The PR description states these follow the conditional-import submitit pattern from analysis/submit.py, but the actual implementation doesn't. Fix: move all parsl imports inside the functions that use them (to_parsl_config(), get_build_app(), build_systems()).

Finding 3 — _shutdown_parsl() swallows all cleanup errors at DEBUG level (confidence: 88)
mdfactory/orchestration/build.py — If parsl.clear() raises (e.g., timeout waiting for workers), the DFK may remain in an undefined state, scancel is never called (SLURM jobs keep running), and the user sees nothing because the error is logged at logger.debug. The "non-fatal" comment is only accurate if the DFK was actually cleared. Fix: log at WARNING/ERROR when parsl.clear() fails, and attempt scancel unconditionally in its own try/except block.

Finding 4 — CLI build command extension is completely untested (confidence: 100)
mdfactory/cli.py — The PR's primary user-facing change (build_system, _build_from_csv, _build_from_yaml, _build_from_summary_yaml) has 0% test coverage. Routing bugs in the suffix dispatch, --dry-run without --config, --config with single YAML, unsupported extensions, and summary YAML error handling would all be invisible.

Finding 5 — _build_system_impl (Parsl worker function) is not tested (confidence: 95)
mdfactory/orchestration/apps.py — The actual function that runs inside Parsl workers is never exercised. Untested behaviors: stripping _-prefixed keys from the input dict, directory creation, os.chdir/restore in the finally block, and the return dict shape. This can be unit-tested by mocking run_build_from_dict and calling _build_system_impl directly.

Suggestions

Finding 6 — wait=False path loads Parsl with no cleanup contract (confidence: 85)
mdfactory/orchestration/build.py — When wait=False, parsl.load() is called but neither parsl.clear() nor _shutdown_parsl() is ever called. The caller receives raw AppFuture objects with no documented cleanup contract. Any subsequent call to build_systems() will crash. Consider removing wait=False (no current caller) or documenting the cleanup requirement.

Finding 7 — Inconsistent output directory for single-YAML builds (confidence: 85)
mdfactory/cli.pymdfactory build system.yaml --output /builds/ puts artifacts in /builds/, but adding --config executor.yaml moves them to /builds/{hash}/. This is a silent behavioral change. The CSV path is consistent (both modes use output/{hash}). Fix: align the sequential single-YAML path to also use output/{hash}, or document the difference.

Finding 8 — Test files fail collection without parsl installed (confidence: 92)
mdfactory/tests/test_orchestration_*.py — Both test files unconditionally import from mdfactory.orchestration.config which imports parsl at module level. Tests can't even be collected without parsl installed, violating the acceptance criterion "Tests pass in CI without parsl installed." Fix: add pytest.importorskip("parsl") or a skipif marker.

Finding 9 — SLURM cleanup and KeyboardInterrupt path never exercised (confidence: 90)
mdfactory/orchestration/build.py_get_slurm_job_ids, _scancel_jobs, and the KeyboardInterrupt branch in _wait_with_progress have zero effective test coverage. These are the most important paths for production use (users interrupting long SLURM runs).

Finding 10 — dict input path in build_systems / build_systems_dry_run untested (confidence: 90)
Both functions accept list[BuildInput | dict] but every test passes mock objects. The dict branch (creating BuildInput(**inp), copying the dict, injecting _build_dir) is never exercised.

Finding 11 — ExecutorConfig.from_yaml error paths untested (confidence: 85)
Missing file, invalid YAML, unknown provider value — all produce raw Python tracebacks rather than user-friendly messages. No tests cover these failure modes.

Finding 12 — parametrize.py bug fix has no regression test (confidence: 85)
The SOL molecule name fix and stale params ITP deletion are untested. All existing tests use resname="SOL", so the fix never triggers. A test with resname="WAT" would catch regressions.

Finding 13 — _get_slurm_job_ids() bare except silently skips SLURM cancellation (confidence: 80)
mdfactory/orchestration/build.py — If Parsl's executor attribute structure changes, _get_slurm_job_ids silently returns [] and SLURM jobs are left running with zero user notification. Fix: log the exception at DEBUG level.

Strengths

  • Clean architecture: the orchestration/ module is well-structured with clear separation of concerns (config, apps, build logic)
  • Rich live progress TUI: the progress display with activity log is a great UX addition for long-running parallel builds
  • Ctrl+C handling: explicit scancel on interrupt shows attention to the HPC user experience
  • min_blocks=0: ensures idle SLURM allocations are released — important for shared cluster etiquette
  • Runtime decoration pattern: get_build_app() decorating at call time rather than import time is a sound design choice
  • Pydantic config models: SlurmExecutorConfig with SLURM-native field names makes config files intuitive for HPC users

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker


Reviewed by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Review Assessment

#17 (comment)

Classifications

Finding Classification Reasoning
1. DFK leaks on exceptions after parsl.load() genuine Confirmed: no try/finally around submission loop or _wait_with_progress. Any exception leaves DFK loaded and SLURM jobs running.
2. Top-level parsl imports break optional-dependency genuine Confirmed: apps.py, build.py, config.py all import parsl at module level. parsl is optional in pyproject.toml. ImportError for users without parsl extra.
3. _shutdown_parsl() swallows errors at DEBUG level genuine Confirmed: single except Exception around parsl.clear() + _scancel_jobs(). If clear() fails, scancel never runs and user sees nothing.
4. CLI build command extension untested genuine Confirmed: build_system(), _build_from_csv(), _build_from_yaml(), _build_from_summary_yaml() — all new code, zero test coverage.
5. _build_system_impl worker function untested genuine Confirmed: the most critical execution path (runs inside Parsl workers) is never exercised in tests.
6. wait=False path has no cleanup contract genuine (low) Confirmed: DFK leaks by design. No current caller, but the public API is broken-by-design.
7. Inconsistent output directory for single-YAML genuine Confirmed: without --config artifacts land in output/, with --config they go to output/{hash}/. Behavioral surprise.
8. Test files fail collection without parsl genuine Confirmed: module-level imports from orchestration.config which imports parsl at top. Tests can't be collected in base env.
9. SLURM cleanup/KeyboardInterrupt untested genuine Confirmed: _get_slurm_job_ids, _scancel_jobs, KeyboardInterrupt branch — all new code, zero coverage.
10. dict input path untested genuine Confirmed: isinstance(inp, dict) branch in both build_systems and build_systems_dry_run never exercised.
11. from_yaml error paths untested genuine (low) Confirmed: missing file, invalid YAML, unknown provider produce raw tracebacks. Low impact since errors do surface.
12. parametrize.py bug fix has no regression test genuine Confirmed: molecule.name = "SOL" fix never triggers in existing tests (all use resname="SOL" already).
13. _get_slurm_job_ids() bare except nitpick Best-effort cleanup introspecting internal Parsl API. Silent pass is defensible. Finding 3 covers the more serious version.

Action Plan

Must fix before merge:

  1. Wrap build_systems() post-parsl.load() body in try/finally: _shutdown_parsl() — DFK leak leaves SLURM jobs running on any exception
  2. Move parsl imports inside functions that use them (to_parsl_config(), get_build_app(), build_systems()) — parsl is optional, top-level imports break users without it
  3. Fix test collection — add pytest.importorskip("parsl") at top of both test files
  4. Raise log level in _shutdown_parsl() from DEBUG to WARNING and attempt scancel unconditionally

Should ship with this PR:

  1. Add CLI integration tests for build_system() dispatch paths (CSV, YAML, summary YAML, dry-run, unsupported extension)
  2. Add unit test for _build_system_impl (mock run_build_from_dict, verify dict stripping, chdir, return shape)
  3. Add tests for KeyboardInterrupt path and SLURM cleanup functions
  4. Add dict-input tests for build_systems() and build_systems_dry_run()
  5. Add regression test for SOL molecule name fix (resname="WAT" input)
  6. Normalize output directory: single-YAML + --config should use output/{hash}/ for consistency

Can follow up:

  1. Document or remove wait=False (no current caller, architectural concern)
  2. Add from_yaml error-path tests (low impact, errors do surface)

Assessment by mach6

- Move parsl imports inside functions (optional-dependency pattern)
- Wrap build_systems() in try/finally for DFK cleanup on exceptions
- Split _shutdown_parsl() into independent blocks, log at WARNING
- Normalize single-YAML output directory to output/{hash}/
- Add pytest.importorskip guard to orchestration test files
- Add tests: _build_system_impl, SLURM cleanup, KeyboardInterrupt,
  dict input path, CLI integration (9 tests), parametrize regression
- Add PLC0415 to ruff ignore (lazy imports intentional throughout)
@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

Review findings addressed (findings 1–10, 12)

Source fixes:

  • Finding 1 (DFK lifecycle): Wrapped build_systems() post-parsl.load() body in try/finally with a cleanup_needed flag — ensures _shutdown_parsl() runs on any exception while preserving wait=False semantics
  • Finding 2 (lazy imports): Moved all parsl imports inside the functions that use them (to_parsl_config(), get_build_app(), build_systems(), _shutdown_parsl()), following the submitit pattern from analysis/submit.py
  • Finding 3 (cleanup logging): Split _shutdown_parsl() into independent try/except blocks — parsl.clear() failures now log at WARNING level, and scancel runs unconditionally
  • Finding 6 (wait=False): Added docstring clarification that caller must call parsl.clear(), plus a logger.warning() on the code path
  • Finding 7 (output dir): Single-YAML sequential builds now use output/{hash}/ for consistency with CSV/Parsl paths

Test additions (18 new tests):

  • Finding 4: New test_orchestration_cli.py with 9 integration tests (CSV, YAML, summary YAML dispatch, dry-run, errors)
  • Finding 5: Tests for _build_system_impl (key stripping, chdir, cwd restore on failure)
  • Finding 8: Added pytest.importorskip("parsl") to both orchestration test files
  • Finding 9: Tests for _shutdown_parsl, _get_slurm_job_ids, KeyboardInterrupt handling
  • Finding 10: Tests for dict input path in build_systems and build_systems_dry_run
  • Finding 12: Regression test for SOL molecule name fix with resname="WAT"

Infrastructure:

  • Added PLC0415 to ruff ignore list (lazy imports are intentional throughout the codebase)

All 37 new/modified tests pass.

Commit: 85694e9


Progress tracked by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Code Review (Re-review after fixes)

Critical

No critical findings.

Important

Finding 1 — Double _shutdown_parsl() on KeyboardInterrupt emits spurious WARNING (confidence: 92)
mdfactory/orchestration/build.py — When Ctrl+C occurs during _wait_with_progress, the function's except KeyboardInterrupt calls _shutdown_parsl() and re-raises. The exception then propagates to build_systems's finally block, which calls _shutdown_parsl() again. The second call finds an already-cleared DFK and emits WARNING: Parsl shutdown failed — DFK may still be loaded. This confuses operators during an already-stressful interrupt. Fix: remove _shutdown_parsl() from inside _wait_with_progress's handler — the finally clause in build_systems already owns the lifecycle.

Finding 2 — _build_from_csv modifies filesystem even under --dry-run (confidence: 83)
mdfactory/cli.py — The directory creation and YAML file writing happen unconditionally, before the dry_run guard check. A user running mdfactory build input.csv --dry-run will have output/{hash}/ directories and {hash}.yaml files written to disk. This violates the --dry-run contract (no side effects) and is inconsistent with _build_from_yaml, which does not write files in dry-run mode. Fix: guard the filesystem writes with if not dry_run:.

Finding 3 — zip(dirs, hashes) in _build_from_summary_yaml silently truncates on length mismatch (confidence: 85)
mdfactory/cli.py — If the summary YAML has system_directory and hash lists of different lengths (corrupted or manually edited), zip silently truncates to the shorter list. Systems are silently dropped with no warning. Fix: assert equal lengths before zipping.

Finding 4 — Empty/null executor config YAML crashes with opaque AttributeError (confidence: 88)
mdfactory/orchestration/config.pyyaml.safe_load returns None for an empty file. data.get("provider", "local") then raises AttributeError: 'NoneType' object has no attribute 'get' with no indication the file is empty/invalid. Fix: guard with if not isinstance(data, dict): raise ValueError(...).

Finding 5 — scancel timeout silently absorbed at DEBUG level (confidence: 85)
mdfactory/orchestration/build.py — On an overloaded SLURM head node, scancel can hang >10s. subprocess.TimeoutExpired lands in the bare except Exception and is logged only at DEBUG. The user gets no indication that GPU jobs are still running. Fix: catch TimeoutExpired explicitly and log at WARNING with the job IDs for manual cancellation.

Finding 6 — _build_from_summary_yaml() happy paths completely untested (confidence: 95)
mdfactory/tests/test_orchestration_cli.py — Only the two sys.exit(1) guard paths are tested. The dry-run, sequential, and parallel execution branches are unexercised. This is the path used when mdfactory build is given prepare-build output — a primary production workflow.

Suggestions

Finding 7 — Empty single-build YAML produces opaque TypeError (confidence: 80)
mdfactory/cli.py — An empty YAML file returns None from safe_load. The code falls into the single-build branch and calls BuildInput(**None)TypeError. Fix: add an isinstance(data, dict) guard after loading.

Finding 8 — _build_from_csv() CSV validation error path untested (confidence: 90)
When df_models_from_input_csv() returns errors, the function logs them and exits. This path is never exercised by any test.

Finding 9 — _wait_with_progress() polling loop never actually iterated (confidence: 82)
All tests use mock futures whose .done() returns truthy MagicMock on first call, so the loop exits immediately. Multi-iteration polling, time.sleep, running counts, and _get_block_status() are never exercised.

Finding 10 — test_build_systems_no_wait doesn't assert non-cleanup contract (confidence: 85)
The test verifies futures are returned but never asserts parsl.clear() was NOT called by build_systems(). The cleanup_needed = False guard could break silently.

Finding 11 — build_system() top-level routing logic not tested end-to-end (confidence: 80)
All CLI tests call _build_from_csv() and _build_from_yaml() directly, bypassing the .suffix-based routing in build_system(). A regression in the switch would not be caught.

Strengths

  • Clean DFK lifecycletry/finally with cleanup_needed flag is a sound pattern (minor double-call issue aside)
  • Lazy import pattern — consistently applied across all 4 new modules, matching the existing submitit pattern
  • Good test coverage overall — 37 tests covering config, dispatch, dry-run, error handling, dict inputs, cleanup
  • Regression test for SOL fix — exercising both resname="SOL" and resname="WAT" paths
  • Rich progress TUI — scales to large campaigns with activity log and SLURM block status
  • Pydantic config with SLURM-native field names — intuitive YAML config files for HPC users
  • All acceptance criteria met — completeness check passed all requirements

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker


Reviewed by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Review Assessment

#17 (comment)

Classifications

Finding Classification Reasoning
1. Double _shutdown_parsl() on KeyboardInterrupt genuine Verified: _wait_with_progress calls _shutdown_parsl() in its KI handler, then build_systems's finally calls it again. Second call emits spurious WARNING about dead DFK.
2. _build_from_csv modifies filesystem under --dry-run genuine Verified: directory creation + YAML writing happens unconditionally before the if dry_run: check. Violates dry-run contract.
3. zip(dirs, hashes) silent truncation genuine Verified: no length validation before zip. Silent data loss if summary YAML is corrupted or manually edited.
4. Empty executor config YAML → AttributeError genuine Verified: yaml.safe_load returns None for empty file, None.get(...) raises AttributeError with no helpful context.
5. scancel timeout silently absorbed at DEBUG nitpick Best-effort cleanup by design. Logging at WARNING would be slightly better but doesn't affect correctness.
6. _build_from_summary_yaml() happy paths untested genuine Verified: only two sys.exit guards tested; dry-run, sequential, and parallel execution branches all unexercised.
7. Empty single-build YAML → TypeError genuine Verified: BuildInput(**None) → TypeError. Same pattern as finding 4 but in different code path.
8. CSV validation error path untested genuine Verified: no test passes a malformed CSV. The error-logging loop and sys.exit are uncovered.
9. _wait_with_progress() polling loop never iterated genuine Verified: MagicMock().done() is truthy → loop breaks on first iteration. Multi-poll, sleep, running-count logic untested.
10. test_build_systems_no_wait missing non-cleanup assertion genuine Verified: parsl.clear is not mocked or asserted. The cleanup_needed = False invariant is not tested.
11. build_system() routing not tested end-to-end genuine Verified: only .txt error path tested; .csv and .yaml routing bypass the public entry point.

Action Plan

Must fix (correctness bugs):

  1. Move directory/YAML creation inside if not dry_run: in _build_from_csv (finding 2)
  2. Guard _build_from_yaml against data is None after yaml.safe_load (finding 7)
  3. Guard ExecutorConfig.from_yaml against data = None from yaml.safe_load (finding 4)
  4. Remove _shutdown_parsl() from _wait_with_progress's KeyboardInterrupt handler — let build_systems's finally own it (finding 1)
  5. Add len(dirs) != len(hashes) validation in _build_from_summary_yaml (finding 3)

Should ship (test gaps for new code):

  1. Add happy-path tests for _build_from_summary_yaml — dry-run, sequential, parallel (finding 6)
  2. Add test for malformed CSV → sys.exit in _build_from_csv (finding 8)
  3. Add multi-iteration polling test for _wait_with_progress (finding 9)
  4. Assert parsl.clear.assert_not_called() in test_build_systems_no_wait (finding 10)
  5. Add smoke tests calling build_system() directly with .csv and .yaml inputs (finding 11)

Optional:

  1. Upgrade TimeoutExpired logging from DEBUG to WARNING in _scancel_jobs (finding 5)

Assessment by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

Re-review findings addressed (findings 1–11)

Source fixes:

  • Finding 1 (double shutdown): Removed _shutdown_parsl() from _wait_with_progress's KeyboardInterrupt handler — build_systems's finally block now solely owns DFK cleanup
  • Finding 2 (dry-run side effects): Moved directory/YAML creation after the dry_run check in _build_from_csv — no filesystem mutations in dry-run mode
  • Finding 3 (zip truncation): Added len(dirs) != len(hashes) validation in _build_from_summary_yaml with clear error message
  • Finding 4 (empty config YAML): Guarded ExecutorConfig.from_yaml against None from yaml.safe_load with descriptive ValueError
  • Finding 5 (scancel timeout): Explicit subprocess.TimeoutExpired handler in _scancel_jobs now logs at WARNING with manual cancellation instructions
  • Finding 7 (empty build YAML): Added isinstance(data, dict) guard in _build_from_yaml with clear exit message

Test additions (10 new tests, 46 total):

  • Finding 6: Happy-path tests for _build_from_summary_yaml (dry-run, sequential, parallel)
  • Finding 3: Test for mismatched list lengths in summary YAML
  • Finding 7: Test for empty YAML file
  • Finding 8: Test for malformed CSV causing sys.exit
  • Finding 9: Multi-iteration polling test for _wait_with_progress (future returns False then True)
  • Finding 10: Assert parsl.clear is NOT called in wait=False mode
  • Finding 11: End-to-end routing smoke tests calling build_system() directly with .csv and .yaml

All 46 orchestration tests pass. Ruff check + format clean.

Commit: 9053796


Progress tracked by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

API simplification: unified build_systems() with dry_run parameter

Removed build_systems_dry_run() — the standalone function is gone. Dry-run is now a parameter on build_systems():

# Before (two functions)
build_systems_dry_run(models, config, output_dir=output)
build_systems(models, config, output_dir=output)

# After (one function)
build_systems(models, config, output_dir=output, dry_run=True)
build_systems(models, config, output_dir=output)

How it works: build_systems() resolves all inputs upfront, then if dry_run=True it logs the plan and returns description dicts before importing or loading Parsl. No Parsl dependency needed for dry-run.

CLI simplified: Each helper (_build_from_csv, _build_from_yaml, _build_from_summary_yaml) now has one dispatch branch (if config is not None or dry_run:) instead of two separate branches. Net -42 lines.

All 46 tests pass.

Commit: 91dff07


Progress tracked by mach6

@maxscheurer

Copy link
Copy Markdown
Collaborator Author

Progress Update

Summary YAML generation + dry-run API simplification

Summary YAML from CSV builds:
mdfactory build input.csv now generates output/{csv_stem}.yaml with the same format as prepare-build:

n_systems: 10
input: /path/to/input.csv
output: /path/to/output
hash: [hash1, hash2, ...]
simulation_type: [mixedbox, bilayer, ...]
system_directory: [/path/to/output/hash1, ...]
date: 2026-06-08 ...

This means mdfactory build input.csv is now a full replacement for the prepare-build + build two-step workflow — it prepares directories, writes per-system YAMLs, generates the summary, and builds.

Dry-run API simplified:

  • Removed standalone build_systems_dry_run() function
  • build_systems() now accepts dry_run=True — resolves inputs, logs the plan, and returns descriptions before touching Parsl
  • CLI helpers simplified from two dispatch branches to one (if config is not None or dry_run:)
  • Net -42 lines across the two commits

48 tests pass.

Commits: 91dff07, 1b2f3be


Progress tracked by mach6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate Parsl as Python-native workflow manager for build/simulate orchestration

1 participant