feat: add Parsl-based parallel build orchestration by maxscheurer · Pull Request #17 · emdgroup/mdfactory

maxscheurer · 2026-06-03T11:04:45Z

Closes #16 (PR 1 of 4)

Add Parsl as a Python-native workflow manager for parallel build orchestration on SLURM GPU nodes. Extends mdfactory build to accept CSV files directly and dispatch builds in parallel via Parsl's @python_app + HighThroughputExecutor.

Implementation plan posted as a comment below.

maxscheurer · 2026-06-03T11:05:41Z

Implementation Plan

Problem

Build orchestration requires Nextflow (build.nf) to parallelize mdfactory build on SLURM GPU nodes. This is the only non-Python tool in the critical build path. We need a Python-native replacement using Parsl that integrates directly into the existing mdfactory build command.

Approach

Use @python_app wrapping run_build_from_dict() — pass a plain dict (from BuildInput.model_dump()), let the Parsl worker do all heavy imports (OpenMM, MDAnalysis, OpenFF) internally. No serialization of scientific objects across processes. No @bash_app / shell escaping needed.

Extend mdfactory build to handle CSV input directly — absorb the prepare→build two-step into one command when running parallel builds. Three input modes:

Input	Behavior
Single YAML	Build one system locally (current behavior, unchanged)
CSV file	Create `BuildInput` models + per-hash dirs, dispatch all builds (parallel with `--config`, sequential without)
Summary YAML or directory of YAMLs	Dispatch existing prepared builds

Deliverables

New module: `mdfactory/orchestration/`

File	Purpose
`__init__.py`	Public API: `ExecutorConfig`, `SlurmExecutorConfig`, `build_systems`, `build_systems_dry_run`
`config.py`	Pydantic models for executor configuration → `parsl.Config` objects
`apps.py`	`@python_app` definition: `build_system_app(build_input_dict)` wrapping `run_build_from_dict`
`build.py`	`build_systems()` — loads Parsl, submits parallel builds, returns futures. `build_systems_dry_run()` — prints what would run without submitting

New test files

File	Coverage
`mdfactory/tests/test_orchestration_config.py`	Config model construction, validation, `to_parsl_config()` with mocked Parsl
`mdfactory/tests/test_orchestration_build.py`	Build dispatch with mocked Parsl, dry-run output, CSV/YAML/directory input handling, error paths

Modified files

File	Change
`pyproject.toml`	Add `parsl = ["parsl>=2024.1.1"]` to `[project.optional-dependencies]`
`mdfactory/cli.py`	Extend `build` command: accept CSV input, add `--config` (executor YAML) and `--dry-run` flags. Reuse `prepare.py` internals for CSV→BuildInput conversion

Design Details

config.py — Executor configuration:

class ExecutorConfig(BaseModel):
    """Base executor configuration."""
    provider: Literal["local", "slurm"] = "local"
    worker_init: str = ""         # module loads, conda/pixi activation
    working_directory: Path | None = None

class SlurmExecutorConfig(ExecutorConfig):
    """SLURM-specific executor configuration."""
    provider: Literal["slurm"] = "slurm"
    account: str                  # required for SLURM
    partition: str = "gpu"
    walltime: str = "2:00:00"
    nodes_per_block: int = 1
    cores_per_node: int = 12
    gpus_per_node: int = 1
    scheduler_options: str = ""   # extra #SBATCH lines

    def to_parsl_config(self) -> "parsl.Config":
        """Build a Parsl Config with HighThroughputExecutor + SlurmProvider."""
        ...

Follows the SlurmConfig dataclass pattern from analysis/submit.py.
Config can be loaded from a YAML file for reuse across campaigns.

apps.py — Parsl app:

@python_app
def build_system_app(build_input_dict: dict):
    """Run a single system build inside a Parsl worker.

    All heavy imports (OpenMM, MDAnalysis, OpenFF) happen inside the
    worker process — only the plain dict crosses the serialization boundary.
    """
    from mdfactory.workflows import run_build_from_dict
    return run_build_from_dict(build_input_dict)

build.py — Orchestration:

build_systems(build_inputs, config, *, wait=True) — accepts list of BuildInput or dicts, calls .model_dump() on each, submits via build_system_app, optionally waits for all futures
build_systems_dry_run(build_inputs, config) — prints each system's hash + simulation type + config summary without loading Parsl
Conditional import parsl inside function bodies (same pattern as submitit in submit.py)

CLI changes (cli.py):

The existing build command currently accepts a single YAML path. Extend it to:

mdfactory build <input>                     # single YAML → local build (unchanged)
mdfactory build <input.csv>                 # CSV → sequential local builds
mdfactory build <input.csv> --config e.yaml # CSV → parallel Parsl builds
mdfactory build <input.csv> --dry-run       # CSV → print what would run
mdfactory build <summary.yaml> --config ... # summary YAML → parallel builds

Input type detection by file extension (.csv vs .yaml/.yml). For CSV input, reuse df_to_build_input_models() from prepare.py and directory creation logic from prepare_build.

Testing Approach

test_orchestration_config.py:

ExecutorConfig and SlurmExecutorConfig construction with defaults
Validation: missing account for SLURM raises error
to_parsl_config() with Parsl mocked via monkeypatch.setitem(sys.modules, "parsl", ...) — verify correct provider/executor types
Config round-trip: write YAML → load → compare
worker_init string propagation into config

test_orchestration_build.py:

Mock Parsl with DummyDataFlowKernel capturing submitted app calls
build_systems() submits correct number of tasks
Each submitted task receives the correct build_input_dict
build_systems_dry_run() returns descriptions without loading Parsl
CSV input: creates directories + dispatches builds
Error handling: missing files, invalid BuildInput, failed futures
Import guard: mdfactory.orchestration importable without parsl installed

Acceptance Criteria

Risks

Worker environment — module load + pixi activation may behave differently inside Parsl workers. First HPC test will reveal this; worker_init config provides the escape hatch.
Parsl version stability — Config API occasionally changes between major versions. Pin >=2024.1.1.
Shared filesystem assumption — design assumes build outputs land on shared FS (standard for HPC). No file staging.
prepare-build overlap — the CSV handling in build partially duplicates prepare-build. We should refactor shared logic into prepare.py helpers but keep prepare-build as a standalone command for users who want the two-step flow.

Plan created by mach6

- New mdfactory/orchestration/ module (config, apps, build) - ExecutorConfig/SlurmExecutorConfig with SLURM-native field names - build_systems() dispatches parallel builds via @python_app - build_systems_dry_run() previews without loading Parsl - Rich live progress display with activity log - Ctrl+C triggers parsl.clear() to cancel SLURM jobs - Extended CLI: mdfactory build accepts CSV/YAML with --config/--dry-run - Added parsl to optional-dependencies and pixi dev env - 19 tests covering config, build dispatch, and error handling

maxscheurer · 2026-06-03T13:18:57Z

Progress Update

Implementation complete

New module mdfactory/orchestration/:

config.py — ExecutorConfig and SlurmExecutorConfig (Pydantic models with SLURM-native field names: gres, mem, qos, constraint, max_blocks)
apps.py — @python_app wrapping run_build_from_dict (decoration at runtime, not import time)
build.py — build_systems() for parallel dispatch with Rich live progress TUI, build_systems_dry_run() for preview
__init__.py — public API exports

CLI extension (cli.py):

mdfactory build now accepts CSV, single YAML, or summary YAML input
--config flag for parallel Parsl builds (local or SLURM)
--dry-run flag to preview without executing

Tests (19 passing):

test_orchestration_config.py — config defaults, YAML load/roundtrip, to_parsl_config(), scheduler options assembly
test_orchestration_build.py — dry-run, submission, error handling, no-wait mode, import guard

Infrastructure:

parsl>=2024.1.1 added to [project.optional-dependencies] and pixi dev env
Rich live progress display with activity log (scales to any number of builds)
Ctrl+C handling: parsl.clear() cancels SLURM jobs
min_blocks=0 ensures idle SLURM jobs are released

Commit: e9925d5

Progress tracked by mach6

- _shutdown_parsl() extracts SLURM job IDs from DFK and runs scancel - Ctrl+C now reliably cancels SLURM jobs instead of relying on parsl.clear() - Error messages no longer truncated in activity log - Increased activity log to 12 entries

When species.resname is not 'SOL' (e.g. 'WAT'), Interchange would generate atom types with the wrong prefix (WAT_0 instead of SOL_0), causing a KeyError in OpenMM's GromacsTopFile parser. Force molecule.name='SOL' before Interchange export so atom types are always consistent (SOL_0, SOL_1, SOL_2). Also invalidate stale SOL_params.itp when the ITP is regenerated.

maxscheurer · 2026-06-04T16:06:35Z

Progress Update

Bug fix: Force molecule.name = "SOL" for water SMIRNOFF parametrization to prevent atom type naming inconsistency.

Problem: When species.resname is "WAT" (or anything other than "SOL"), Interchange generated atom types with that prefix (WAT_0, WAT_1, WAT_2) in the params file, while the molecule ITP expected SOL_0 types. This caused KeyError: 'SOL_0' when OpenMM tried to load the GROMACS topology.

Fix: Override molecule.name = "SOL" before Interchange export so the cached water parameters are always internally consistent. Also delete stale SOL_params.itp when the ITP is regenerated to prevent cache inconsistencies.

Commit: 93b8491

Progress tracked by mach6

maxscheurer · 2026-06-08T06:43:18Z

Code Review

Critical

No critical findings.

Important

Finding 1 — Parsl DFK leaks on exceptions after parsl.load() (confidence: 92)
mdfactory/orchestration/build.py — build_systems() calls parsl.load() at the top but only calls _shutdown_parsl() at the bottom with no try/finally. If any exception occurs during the submission loop (e.g., ValidationError from a malformed BuildInput dict) or during _wait_with_progress (anything other than KeyboardInterrupt), the DFK stays loaded, SLURM jobs keep running, and the next call to build_systems() crashes with Parsl's "DFK already loaded" RuntimeError. Fix: wrap the post-parsl.load() body in try/finally: _shutdown_parsl().

Finding 2 — Top-level parsl imports break the optional-dependency pattern (confidence: 90)
orchestration/apps.py, orchestration/build.py, and orchestration/config.py all import parsl at module level. Since orchestration/__init__.py re-exports from these modules, any import from mdfactory.orchestration triggers ImportError when parsl isn't installed — even for code paths that never use Parsl (like --dry-run). The PR description states these follow the conditional-import submitit pattern from analysis/submit.py, but the actual implementation doesn't. Fix: move all parsl imports inside the functions that use them (to_parsl_config(), get_build_app(), build_systems()).

Finding 3 — _shutdown_parsl() swallows all cleanup errors at DEBUG level (confidence: 88)
mdfactory/orchestration/build.py — If parsl.clear() raises (e.g., timeout waiting for workers), the DFK may remain in an undefined state, scancel is never called (SLURM jobs keep running), and the user sees nothing because the error is logged at logger.debug. The "non-fatal" comment is only accurate if the DFK was actually cleared. Fix: log at WARNING/ERROR when parsl.clear() fails, and attempt scancel unconditionally in its own try/except block.

Finding 4 — CLI build command extension is completely untested (confidence: 100)
mdfactory/cli.py — The PR's primary user-facing change (build_system, _build_from_csv, _build_from_yaml, _build_from_summary_yaml) has 0% test coverage. Routing bugs in the suffix dispatch, --dry-run without --config, --config with single YAML, unsupported extensions, and summary YAML error handling would all be invisible.

Finding 5 — _build_system_impl (Parsl worker function) is not tested (confidence: 95)
mdfactory/orchestration/apps.py — The actual function that runs inside Parsl workers is never exercised. Untested behaviors: stripping _-prefixed keys from the input dict, directory creation, os.chdir/restore in the finally block, and the return dict shape. This can be unit-tested by mocking run_build_from_dict and calling _build_system_impl directly.

Suggestions

Finding 6 — wait=False path loads Parsl with no cleanup contract (confidence: 85)
mdfactory/orchestration/build.py — When wait=False, parsl.load() is called but neither parsl.clear() nor _shutdown_parsl() is ever called. The caller receives raw AppFuture objects with no documented cleanup contract. Any subsequent call to build_systems() will crash. Consider removing wait=False (no current caller) or documenting the cleanup requirement.

Finding 7 — Inconsistent output directory for single-YAML builds (confidence: 85)
mdfactory/cli.py — mdfactory build system.yaml --output /builds/ puts artifacts in /builds/, but adding --config executor.yaml moves them to /builds/{hash}/. This is a silent behavioral change. The CSV path is consistent (both modes use output/{hash}). Fix: align the sequential single-YAML path to also use output/{hash}, or document the difference.

Finding 8 — Test files fail collection without parsl installed (confidence: 92)
mdfactory/tests/test_orchestration_*.py — Both test files unconditionally import from mdfactory.orchestration.config which imports parsl at module level. Tests can't even be collected without parsl installed, violating the acceptance criterion "Tests pass in CI without parsl installed." Fix: add pytest.importorskip("parsl") or a skipif marker.

Finding 9 — SLURM cleanup and KeyboardInterrupt path never exercised (confidence: 90)
mdfactory/orchestration/build.py — _get_slurm_job_ids, _scancel_jobs, and the KeyboardInterrupt branch in _wait_with_progress have zero effective test coverage. These are the most important paths for production use (users interrupting long SLURM runs).

Finding 10 — dict input path in build_systems / build_systems_dry_run untested (confidence: 90)
Both functions accept list[BuildInput | dict] but every test passes mock objects. The dict branch (creating BuildInput(**inp), copying the dict, injecting _build_dir) is never exercised.

Finding 11 — ExecutorConfig.from_yaml error paths untested (confidence: 85)
Missing file, invalid YAML, unknown provider value — all produce raw Python tracebacks rather than user-friendly messages. No tests cover these failure modes.

Finding 12 — parametrize.py bug fix has no regression test (confidence: 85)
The SOL molecule name fix and stale params ITP deletion are untested. All existing tests use resname="SOL", so the fix never triggers. A test with resname="WAT" would catch regressions.

Finding 13 — _get_slurm_job_ids() bare except silently skips SLURM cancellation (confidence: 80)
mdfactory/orchestration/build.py — If Parsl's executor attribute structure changes, _get_slurm_job_ids silently returns [] and SLURM jobs are left running with zero user notification. Fix: log the exception at DEBUG level.

Strengths

Clean architecture: the orchestration/ module is well-structured with clear separation of concerns (config, apps, build logic)
Rich live progress TUI: the progress display with activity log is a great UX addition for long-running parallel builds
Ctrl+C handling: explicit scancel on interrupt shows attention to the HPC user experience
min_blocks=0: ensures idle SLURM allocations are released — important for shared cluster etiquette
Runtime decoration pattern: get_build_app() decorating at call time rather than import time is a sound design choice
Pydantic config models: SlurmExecutorConfig with SLURM-native field names makes config files intuitive for HPC users

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker

Reviewed by mach6

maxscheurer · 2026-06-08T06:46:46Z

Review Assessment

#17 (comment)

Classifications

Finding	Classification	Reasoning
1. DFK leaks on exceptions after `parsl.load()`	genuine	Confirmed: no `try/finally` around submission loop or `_wait_with_progress`. Any exception leaves DFK loaded and SLURM jobs running.
2. Top-level parsl imports break optional-dependency	genuine	Confirmed: `apps.py`, `build.py`, `config.py` all import parsl at module level. parsl is optional in pyproject.toml. ImportError for users without parsl extra.
3. `_shutdown_parsl()` swallows errors at DEBUG level	genuine	Confirmed: single `except Exception` around `parsl.clear()` + `_scancel_jobs()`. If `clear()` fails, scancel never runs and user sees nothing.
4. CLI build command extension untested	genuine	Confirmed: `build_system()`, `_build_from_csv()`, `_build_from_yaml()`, `_build_from_summary_yaml()` — all new code, zero test coverage.
5. `_build_system_impl` worker function untested	genuine	Confirmed: the most critical execution path (runs inside Parsl workers) is never exercised in tests.
6. `wait=False` path has no cleanup contract	genuine (low)	Confirmed: DFK leaks by design. No current caller, but the public API is broken-by-design.
7. Inconsistent output directory for single-YAML	genuine	Confirmed: without `--config` artifacts land in `output/`, with `--config` they go to `output/{hash}/`. Behavioral surprise.
8. Test files fail collection without parsl	genuine	Confirmed: module-level imports from `orchestration.config` which imports parsl at top. Tests can't be collected in base env.
9. SLURM cleanup/KeyboardInterrupt untested	genuine	Confirmed: `_get_slurm_job_ids`, `_scancel_jobs`, KeyboardInterrupt branch — all new code, zero coverage.
10. dict input path untested	genuine	Confirmed: `isinstance(inp, dict)` branch in both `build_systems` and `build_systems_dry_run` never exercised.
11. `from_yaml` error paths untested	genuine (low)	Confirmed: missing file, invalid YAML, unknown provider produce raw tracebacks. Low impact since errors do surface.
12. parametrize.py bug fix has no regression test	genuine	Confirmed: `molecule.name = "SOL"` fix never triggers in existing tests (all use `resname="SOL"` already).
13. `_get_slurm_job_ids()` bare except	nitpick	Best-effort cleanup introspecting internal Parsl API. Silent pass is defensible. Finding 3 covers the more serious version.

Action Plan

Must fix before merge:

Wrap build_systems() post-parsl.load() body in try/finally: _shutdown_parsl() — DFK leak leaves SLURM jobs running on any exception
Move parsl imports inside functions that use them (to_parsl_config(), get_build_app(), build_systems()) — parsl is optional, top-level imports break users without it
Fix test collection — add pytest.importorskip("parsl") at top of both test files
Raise log level in _shutdown_parsl() from DEBUG to WARNING and attempt scancel unconditionally

Should ship with this PR:

Add CLI integration tests for build_system() dispatch paths (CSV, YAML, summary YAML, dry-run, unsupported extension)
Add unit test for _build_system_impl (mock run_build_from_dict, verify dict stripping, chdir, return shape)
Add tests for KeyboardInterrupt path and SLURM cleanup functions
Add dict-input tests for build_systems() and build_systems_dry_run()
Add regression test for SOL molecule name fix (resname="WAT" input)
Normalize output directory: single-YAML + --config should use output/{hash}/ for consistency

Can follow up:

Document or remove wait=False (no current caller, architectural concern)
Add from_yaml error-path tests (low impact, errors do surface)

Assessment by mach6

- Move parsl imports inside functions (optional-dependency pattern) - Wrap build_systems() in try/finally for DFK cleanup on exceptions - Split _shutdown_parsl() into independent blocks, log at WARNING - Normalize single-YAML output directory to output/{hash}/ - Add pytest.importorskip guard to orchestration test files - Add tests: _build_system_impl, SLURM cleanup, KeyboardInterrupt, dict input path, CLI integration (9 tests), parametrize regression - Add PLC0415 to ruff ignore (lazy imports intentional throughout)

maxscheurer · 2026-06-08T07:07:49Z

Progress Update

Review findings addressed (findings 1–10, 12)

Source fixes:

Finding 1 (DFK lifecycle): Wrapped build_systems() post-parsl.load() body in try/finally with a cleanup_needed flag — ensures _shutdown_parsl() runs on any exception while preserving wait=False semantics
Finding 2 (lazy imports): Moved all parsl imports inside the functions that use them (to_parsl_config(), get_build_app(), build_systems(), _shutdown_parsl()), following the submitit pattern from analysis/submit.py
Finding 3 (cleanup logging): Split _shutdown_parsl() into independent try/except blocks — parsl.clear() failures now log at WARNING level, and scancel runs unconditionally
Finding 6 (wait=False): Added docstring clarification that caller must call parsl.clear(), plus a logger.warning() on the code path
Finding 7 (output dir): Single-YAML sequential builds now use output/{hash}/ for consistency with CSV/Parsl paths

Test additions (18 new tests):

Finding 4: New test_orchestration_cli.py with 9 integration tests (CSV, YAML, summary YAML dispatch, dry-run, errors)
Finding 5: Tests for _build_system_impl (key stripping, chdir, cwd restore on failure)
Finding 8: Added pytest.importorskip("parsl") to both orchestration test files
Finding 9: Tests for _shutdown_parsl, _get_slurm_job_ids, KeyboardInterrupt handling
Finding 10: Tests for dict input path in build_systems and build_systems_dry_run
Finding 12: Regression test for SOL molecule name fix with resname="WAT"

Infrastructure:

Added PLC0415 to ruff ignore list (lazy imports are intentional throughout the codebase)

All 37 new/modified tests pass.

Commit: 85694e9

Progress tracked by mach6

maxscheurer · 2026-06-08T07:15:16Z

Code Review (Re-review after fixes)

Critical

No critical findings.

Important

Finding 1 — Double _shutdown_parsl() on KeyboardInterrupt emits spurious WARNING (confidence: 92)
mdfactory/orchestration/build.py — When Ctrl+C occurs during _wait_with_progress, the function's except KeyboardInterrupt calls _shutdown_parsl() and re-raises. The exception then propagates to build_systems's finally block, which calls _shutdown_parsl() again. The second call finds an already-cleared DFK and emits WARNING: Parsl shutdown failed — DFK may still be loaded. This confuses operators during an already-stressful interrupt. Fix: remove _shutdown_parsl() from inside _wait_with_progress's handler — the finally clause in build_systems already owns the lifecycle.

Finding 2 — _build_from_csv modifies filesystem even under --dry-run (confidence: 83)
mdfactory/cli.py — The directory creation and YAML file writing happen unconditionally, before the dry_run guard check. A user running mdfactory build input.csv --dry-run will have output/{hash}/ directories and {hash}.yaml files written to disk. This violates the --dry-run contract (no side effects) and is inconsistent with _build_from_yaml, which does not write files in dry-run mode. Fix: guard the filesystem writes with if not dry_run:.

Finding 3 — zip(dirs, hashes) in _build_from_summary_yaml silently truncates on length mismatch (confidence: 85)
mdfactory/cli.py — If the summary YAML has system_directory and hash lists of different lengths (corrupted or manually edited), zip silently truncates to the shorter list. Systems are silently dropped with no warning. Fix: assert equal lengths before zipping.

Finding 4 — Empty/null executor config YAML crashes with opaque AttributeError (confidence: 88)
mdfactory/orchestration/config.py — yaml.safe_load returns None for an empty file. data.get("provider", "local") then raises AttributeError: 'NoneType' object has no attribute 'get' with no indication the file is empty/invalid. Fix: guard with if not isinstance(data, dict): raise ValueError(...).

Finding 5 — scancel timeout silently absorbed at DEBUG level (confidence: 85)
mdfactory/orchestration/build.py — On an overloaded SLURM head node, scancel can hang >10s. subprocess.TimeoutExpired lands in the bare except Exception and is logged only at DEBUG. The user gets no indication that GPU jobs are still running. Fix: catch TimeoutExpired explicitly and log at WARNING with the job IDs for manual cancellation.

Finding 6 — _build_from_summary_yaml() happy paths completely untested (confidence: 95)
mdfactory/tests/test_orchestration_cli.py — Only the two sys.exit(1) guard paths are tested. The dry-run, sequential, and parallel execution branches are unexercised. This is the path used when mdfactory build is given prepare-build output — a primary production workflow.

Suggestions

Finding 7 — Empty single-build YAML produces opaque TypeError (confidence: 80)
mdfactory/cli.py — An empty YAML file returns None from safe_load. The code falls into the single-build branch and calls BuildInput(**None) → TypeError. Fix: add an isinstance(data, dict) guard after loading.

Finding 8 — _build_from_csv() CSV validation error path untested (confidence: 90)
When df_models_from_input_csv() returns errors, the function logs them and exits. This path is never exercised by any test.

Finding 9 — _wait_with_progress() polling loop never actually iterated (confidence: 82)
All tests use mock futures whose .done() returns truthy MagicMock on first call, so the loop exits immediately. Multi-iteration polling, time.sleep, running counts, and _get_block_status() are never exercised.

Finding 10 — test_build_systems_no_wait doesn't assert non-cleanup contract (confidence: 85)
The test verifies futures are returned but never asserts parsl.clear() was NOT called by build_systems(). The cleanup_needed = False guard could break silently.

Finding 11 — build_system() top-level routing logic not tested end-to-end (confidence: 80)
All CLI tests call _build_from_csv() and _build_from_yaml() directly, bypassing the .suffix-based routing in build_system(). A regression in the switch would not be caught.

Strengths

Clean DFK lifecycle — try/finally with cleanup_needed flag is a sound pattern (minor double-call issue aside)
Lazy import pattern — consistently applied across all 4 new modules, matching the existing submitit pattern
Good test coverage overall — 37 tests covering config, dispatch, dry-run, error handling, dict inputs, cleanup
Regression test for SOL fix — exercising both resname="SOL" and resname="WAT" paths
Rich progress TUI — scales to large campaigns with activity log and SLURM block status
Pydantic config with SLURM-native field names — intuitive YAML config files for HPC users
All acceptance criteria met — completeness check passed all requirements

Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker

Reviewed by mach6

maxscheurer · 2026-06-08T07:18:43Z

Review Assessment

#17 (comment)

Classifications

Finding	Classification	Reasoning
1. Double `_shutdown_parsl()` on KeyboardInterrupt	genuine	Verified: `_wait_with_progress` calls `_shutdown_parsl()` in its KI handler, then `build_systems`'s `finally` calls it again. Second call emits spurious WARNING about dead DFK.
2. `_build_from_csv` modifies filesystem under `--dry-run`	genuine	Verified: directory creation + YAML writing happens unconditionally before the `if dry_run:` check. Violates dry-run contract.
3. `zip(dirs, hashes)` silent truncation	genuine	Verified: no length validation before `zip`. Silent data loss if summary YAML is corrupted or manually edited.
4. Empty executor config YAML → AttributeError	genuine	Verified: `yaml.safe_load` returns None for empty file, `None.get(...)` raises AttributeError with no helpful context.
5. scancel timeout silently absorbed at DEBUG	nitpick	Best-effort cleanup by design. Logging at WARNING would be slightly better but doesn't affect correctness.
6. `_build_from_summary_yaml()` happy paths untested	genuine	Verified: only two `sys.exit` guards tested; dry-run, sequential, and parallel execution branches all unexercised.
7. Empty single-build YAML → TypeError	genuine	Verified: `BuildInput(**None)` → TypeError. Same pattern as finding 4 but in different code path.
8. CSV validation error path untested	genuine	Verified: no test passes a malformed CSV. The error-logging loop and sys.exit are uncovered.
9. `_wait_with_progress()` polling loop never iterated	genuine	Verified: MagicMock().done() is truthy → loop breaks on first iteration. Multi-poll, sleep, running-count logic untested.
10. `test_build_systems_no_wait` missing non-cleanup assertion	genuine	Verified: `parsl.clear` is not mocked or asserted. The `cleanup_needed = False` invariant is not tested.
11. `build_system()` routing not tested end-to-end	genuine	Verified: only `.txt` error path tested; `.csv` and `.yaml` routing bypass the public entry point.

Action Plan

Must fix (correctness bugs):

Move directory/YAML creation inside if not dry_run: in _build_from_csv (finding 2)
Guard _build_from_yaml against data is None after yaml.safe_load (finding 7)
Guard ExecutorConfig.from_yaml against data = None from yaml.safe_load (finding 4)
Remove _shutdown_parsl() from _wait_with_progress's KeyboardInterrupt handler — let build_systems's finally own it (finding 1)
Add len(dirs) != len(hashes) validation in _build_from_summary_yaml (finding 3)

Should ship (test gaps for new code):

Add happy-path tests for _build_from_summary_yaml — dry-run, sequential, parallel (finding 6)
Add test for malformed CSV → sys.exit in _build_from_csv (finding 8)
Add multi-iteration polling test for _wait_with_progress (finding 9)
Assert parsl.clear.assert_not_called() in test_build_systems_no_wait (finding 10)
Add smoke tests calling build_system() directly with .csv and .yaml inputs (finding 11)

Optional:

Upgrade TimeoutExpired logging from DEBUG to WARNING in _scancel_jobs (finding 5)

Assessment by mach6

maxscheurer · 2026-06-08T07:27:59Z

Progress Update

Re-review findings addressed (findings 1–11)

Source fixes:

Finding 1 (double shutdown): Removed _shutdown_parsl() from _wait_with_progress's KeyboardInterrupt handler — build_systems's finally block now solely owns DFK cleanup
Finding 2 (dry-run side effects): Moved directory/YAML creation after the dry_run check in _build_from_csv — no filesystem mutations in dry-run mode
Finding 3 (zip truncation): Added len(dirs) != len(hashes) validation in _build_from_summary_yaml with clear error message
Finding 4 (empty config YAML): Guarded ExecutorConfig.from_yaml against None from yaml.safe_load with descriptive ValueError
Finding 5 (scancel timeout): Explicit subprocess.TimeoutExpired handler in _scancel_jobs now logs at WARNING with manual cancellation instructions
Finding 7 (empty build YAML): Added isinstance(data, dict) guard in _build_from_yaml with clear exit message

Test additions (10 new tests, 46 total):

Finding 6: Happy-path tests for _build_from_summary_yaml (dry-run, sequential, parallel)
Finding 3: Test for mismatched list lengths in summary YAML
Finding 7: Test for empty YAML file
Finding 8: Test for malformed CSV causing sys.exit
Finding 9: Multi-iteration polling test for _wait_with_progress (future returns False then True)
Finding 10: Assert parsl.clear is NOT called in wait=False mode
Finding 11: End-to-end routing smoke tests calling build_system() directly with .csv and .yaml

All 46 orchestration tests pass. Ruff check + format clean.

Commit: 9053796

Progress tracked by mach6

…_dry_run

maxscheurer · 2026-06-08T07:46:47Z

Progress Update

API simplification: unified `build_systems()` with `dry_run` parameter

Removed build_systems_dry_run() — the standalone function is gone. Dry-run is now a parameter on build_systems():

# Before (two functions)
build_systems_dry_run(models, config, output_dir=output)
build_systems(models, config, output_dir=output)

# After (one function)
build_systems(models, config, output_dir=output, dry_run=True)
build_systems(models, config, output_dir=output)

How it works: build_systems() resolves all inputs upfront, then if dry_run=True it logs the plan and returns description dicts before importing or loading Parsl. No Parsl dependency needed for dry-run.

CLI simplified: Each helper (_build_from_csv, _build_from_yaml, _build_from_summary_yaml) now has one dispatch branch (if config is not None or dry_run:) instead of two separate branches. Net -42 lines.

All 46 tests pass.

Commit: 91dff07

Progress tracked by mach6

…uild output)

maxscheurer · 2026-06-08T09:24:07Z

Progress Update

Summary YAML generation + dry-run API simplification

Summary YAML from CSV builds:
mdfactory build input.csv now generates output/{csv_stem}.yaml with the same format as prepare-build:

n_systems: 10
input: /path/to/input.csv
output: /path/to/output
hash: [hash1, hash2, ...]
simulation_type: [mixedbox, bilayer, ...]
system_directory: [/path/to/output/hash1, ...]
date: 2026-06-08 ...

This means mdfactory build input.csv is now a full replacement for the prepare-build + build two-step workflow — it prepares directories, writes per-system YAMLs, generates the summary, and builds.

Dry-run API simplified:

Removed standalone build_systems_dry_run() function
build_systems() now accepts dry_run=True — resolves inputs, logs the plan, and returns descriptions before touching Parsl
CLI helpers simplified from two dispatch branches to one (if config is not None or dry_run:)
Net -42 lines across the two commits

48 tests pass.

Commits: 91dff07, 1b2f3be

Progress tracked by mach6

chore: open PR for issue 16 (PR 1 — Parsl build orchestration)

39b881e

maxscheurer added 4 commits June 3, 2026 15:23

feat: show SLURM job status (running/pending) in build TUI

2e29ce9

style: add icon to SLURM status line for consistent indent

3f12d42

maxscheurer marked this pull request as ready for review June 8, 2026 06:35

fix: address re-review findings — input validation, dry-run guard, tests

9053796

refactor: collapse dry-run into build_systems(), remove build_systems…

91dff07

…_dry_run

feat: generate summary YAML when building from CSV (matches prepare-b…

1b2f3be

…uild output)

gregorweiss mentioned this pull request Jun 11, 2026

Extract BaseSlurmConfig to unify SLURM resource config across submitit and Parsl backends #20

Open

maxscheurer mentioned this pull request Jun 11, 2026

HPC Infrastructure Mega-Issue #21

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Parsl-based parallel build orchestration#17

feat: add Parsl-based parallel build orchestration#17
maxscheurer wants to merge 10 commits into
mainfrom
feature/issue-16-parsl-build-orchestration

maxscheurer commented Jun 3, 2026

Uh oh!

maxscheurer commented Jun 3, 2026

Uh oh!

maxscheurer commented Jun 3, 2026

Uh oh!

maxscheurer commented Jun 4, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

maxscheurer commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxscheurer commented Jun 3, 2026

Uh oh!

maxscheurer commented Jun 3, 2026

Implementation Plan

Problem

Approach

Deliverables

New module: mdfactory/orchestration/

New test files

Modified files

Design Details

Testing Approach

Acceptance Criteria

Risks

Uh oh!

maxscheurer commented Jun 3, 2026

Progress Update

Implementation complete

Uh oh!

maxscheurer commented Jun 4, 2026

Progress Update

Uh oh!

maxscheurer commented Jun 8, 2026

Code Review

Critical

Important

Suggestions

Strengths

Uh oh!

maxscheurer commented Jun 8, 2026

Review Assessment

Classifications

Action Plan

Uh oh!

maxscheurer commented Jun 8, 2026

Progress Update

Review findings addressed (findings 1–10, 12)

Uh oh!

maxscheurer commented Jun 8, 2026

Code Review (Re-review after fixes)

Critical

Important

Suggestions

Strengths

Uh oh!

maxscheurer commented Jun 8, 2026

Review Assessment

Classifications

Action Plan

Uh oh!

maxscheurer commented Jun 8, 2026

Progress Update

Re-review findings addressed (findings 1–11)

Uh oh!

maxscheurer commented Jun 8, 2026

Progress Update

API simplification: unified build_systems() with dry_run parameter

Uh oh!

maxscheurer commented Jun 8, 2026

Progress Update

Summary YAML generation + dry-run API simplification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New module: `mdfactory/orchestration/`

API simplification: unified `build_systems()` with `dry_run` parameter