feat: add slurm cluster autodiscovery by gregorweiss · Pull Request #11 · emdgroup/mdfactory

gregorweiss · 2026-05-18T19:05:30Z

Adds mdfactory/performance/cluster.py — queries sinfo and sacctmgr to discover cluster resources automatically. This is the foundation for all planned HPC features (benchmarking, node packing, GPU sharing).

What it does

discover_cluster() → returns partitions, node specs (CPUs, memory, GPUs), accounts, and QOS policies as frozen dataclasses
select_partition(cluster, needs_gpu=True, min_cpus=64) → picks the best partition for given requirements
Returns None gracefully on non-SLURM machines
Results cached per session (topology doesn't change mid-run)

Files

mdfactory/performance/__init__.py — new package
mdfactory/performance/cluster.py — autodiscovery implementation
mdfactory/tests/test_cluster.py — unit tests with mocked SLURM output

Test plan

pytest mdfactory/tests/test_cluster.py -v (mocked, no cluster needed)
Verify discover_cluster() returns None on laptop
Verify populated ClusterInfo on a real SLURM node

CLAassistant · 2026-05-18T19:05:38Z

All committers have signed the CLA.

* Add SlurmConfig.from_cluster() classmethod for autodiscovery * Use autodiscovery fallback when --account not provided in SLURM commands * Add tests for SlurmConfig.from_cluster() autodiscovery * Add tests for mdfactory config cluster CLI command

maxscheurer · 2026-06-02T08:26:56Z

+@dataclass(frozen=True)
+class NodeType:
+    """Hardware specification of a node type within a partition.
+
+    Parameters
+    ----------
+    cpus : int
+        Number of CPU cores per node.
+    memory_mb : int
+        Memory in megabytes per node.
+    gpu_specs : tuple of (int, str)
+        GPU specifications as (count, type) tuples. Empty if CPU-only.
+        Multiple entries represent different GPU types on the same node.
+    features : tuple of str
+        SLURM feature/constraint tags on this node type (immutable).
+    count : int
+        Number of nodes with this exact configuration.
+    """
+
+    cpus: int
+    memory_mb: int
+    gpu_specs: tuple[tuple[int, str], ...] = field(default_factory=tuple)
+    features: tuple[str, ...] = field(default_factory=tuple)
+    count: int = 1


We use pydantic everywhere else in mdfactory, so would be great to use that throughout.

maxscheurer · 2026-06-02T08:28:06Z

+    default_account: str | None = None
+
+
+def _run_command(cmd: list[str], *, timeout: int = 30) -> str | None:


Check if we have this utility function somewhere else already.

Good catch! I've refactored this into a generalized run_command() utility in mdfactory/utils/utilities.py.

The cluster autodiscovery now uses this instead of its local _run_command(). I also identified several other subprocess calls that could benefit from this utility (in cli.py, sync_config.py, and bilayer/artifacts/__init__.py), but didn't want to expand the scope of this PR beyond cluster autodiscovery.

I've created issue #19 to track migrating those locations separately - keeping this PR focused while ensuring the broader refactoring doesn't get lost.

…ackage

… from performance

…settings properties

… model, from_yaml

…n, slurm_partition_cpu

…tighten docstring

…rom_cluster()

…ter tests

gregorweiss · 2026-06-11T11:24:45Z

The PR #11 scope from #20 is complete on this branch: BaseSlurmConfig extracted to performance/slurm_config.py, SlurmConfig migrated to Pydantic, normalize_slurm_time moved, from_yaml() added, config.ini split into PARTITION_CPU/PARTITION_GPU/DEFAULT_QOS, tests updated. Re-export shim in analysis/submit.py keeps existing callers intact.

gregorweiss added 6 commits May 18, 2026 18:25

Add SLURM cluster autodiscovery module with unit tests

18f9fe9

Fix partition state: report 'up' if any node is schedulable

0a437c8

Query real SLURM default account via sacctmgr show user

dd94804

Parse default_time from sinfo %L field separately from max_time

9530570

Use tuple for NodeType.features to enforce full immutability

4e62324

Add _run_command edge case tests (timeout, missing binary, nonzero exit)

fe75847

gregorweiss self-assigned this May 18, 2026

gregorweiss added the enhancement New feature or request label May 18, 2026

gregorweiss linked an issue May 18, 2026 that may be closed by this pull request

Implement SLURM cluster autodiscovery #7

Open

gregorweiss added 3 commits May 20, 2026 20:43

Fix PLW2901 lint errors and reformat test file

925488d

fix new NodeType signature in tests

c4e9376

gregorweiss force-pushed the worktree-slurm-cluster-autodiscovery branch from 1d6b0ad to c4e9376 Compare June 1, 2026 12:30

gregorweiss changed the title ~~WIP feat: add slurm cluster autodiscovery~~ feat: add slurm cluster autodiscovery Jun 1, 2026

Slurm settings taking precendence

b974209

maxscheurer reviewed Jun 2, 2026

View reviewed changes

Convert dataclasses to Pydantic models for consistency

784381a

gregorweiss mentioned this pull request Jun 10, 2026

Refactor subprocess calls to use run_command utility #19

Open

3 tasks

Refactor subprocess execution into generalized run_command utility

c0b6d34

gregorweiss force-pushed the worktree-slurm-cluster-autodiscovery branch from 8e5eef0 to c0b6d34 Compare June 10, 2026 07:47

gregorweiss requested a review from maxscheurer June 10, 2026 08:31

gregorweiss mentioned this pull request Jun 11, 2026

Extract BaseSlurmConfig to unify SLURM resource config across submitit and Parsl backends #20

Open

gregorweiss added 7 commits June 11, 2026 10:53

add BaseSlurmConfig/SlurmConfig/normalize_slurm_time to performance p…

6e0a1b5

…ackage

migrate submit.py: replace @DataClass SlurmConfig with re-export shim…

533204b

… from performance

split slurm ini keys to PARTITION_CPU/PARTITION_GPU/DEFAULT_QOS, add …

078d69f

…settings properties

add test_slurm_config: BaseSlurmConfig 3-tier precedence, SlurmConfig…

42c7fee

… model, from_yaml

update test_submit: Pydantic SlurmConfig, min_cpus partition selectio…

a088281

…n, slurm_partition_cpu

fix from_cluster() return type to Self, path: Path|str on from_yaml, …

69e780d

…tighten docstring

fix cli.py: partition=None sentinel, forward min_cpus/min_mem_gb to f…

6bca00d

…rom_cluster()

gregorweiss added 2 commits June 11, 2026 12:43

add no_slurm_settings fixture; apply to isolation-sensitive from_clus…

4a4a065

…ter tests

extend no_slurm_settings to positive-path autodiscovery tests

c1e6265

pre-commit fixes

2864fa1

maxscheurer mentioned this pull request Jun 11, 2026

HPC Infrastructure Mega-Issue #21

Open

9 tasks

gregorweiss changed the base branch from main to develop June 11, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add slurm cluster autodiscovery#11

feat: add slurm cluster autodiscovery#11
gregorweiss wants to merge 22 commits into
developfrom
worktree-slurm-cluster-autodiscovery

gregorweiss commented May 18, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 18, 2026 •

edited

Loading

Uh oh!

maxscheurer Jun 2, 2026

Uh oh!

gregorweiss Jun 10, 2026

Uh oh!

maxscheurer Jun 2, 2026

Uh oh!

gregorweiss Jun 10, 2026

Uh oh!

gregorweiss commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		default_account: str \| None = None


		def _run_command(cmd: list[str], *, timeout: int = 30) -> str \| None:

Conversation

gregorweiss commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

Files

Test plan

Uh oh!

CLAassistant commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxscheurer Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gregorweiss Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

maxscheurer Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gregorweiss Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gregorweiss commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gregorweiss commented May 18, 2026 •

edited

Loading

CLAassistant commented May 18, 2026 •

edited

Loading