Skip to content

feat: add slurm cluster autodiscovery#11

Open
gregorweiss wants to merge 22 commits into
developfrom
worktree-slurm-cluster-autodiscovery
Open

feat: add slurm cluster autodiscovery#11
gregorweiss wants to merge 22 commits into
developfrom
worktree-slurm-cluster-autodiscovery

Conversation

@gregorweiss

@gregorweiss gregorweiss commented May 18, 2026

Copy link
Copy Markdown
Collaborator

Adds mdfactory/performance/cluster.py — queries sinfo and sacctmgr to discover cluster resources automatically. This is the foundation for all planned HPC features (benchmarking, node packing, GPU sharing).

What it does

  • discover_cluster() → returns partitions, node specs (CPUs, memory, GPUs), accounts, and QOS policies as frozen dataclasses
  • select_partition(cluster, needs_gpu=True, min_cpus=64) → picks the best partition for given requirements
  • Returns None gracefully on non-SLURM machines
  • Results cached per session (topology doesn't change mid-run)

Files

  • mdfactory/performance/__init__.py — new package
  • mdfactory/performance/cluster.py — autodiscovery implementation
  • mdfactory/tests/test_cluster.py — unit tests with mocked SLURM output

Test plan

  • pytest mdfactory/tests/test_cluster.py -v (mocked, no cluster needed)
  • Verify discover_cluster() returns None on laptop
  • Verify populated ClusterInfo on a real SLURM node

@gregorweiss gregorweiss self-assigned this May 18, 2026
@gregorweiss gregorweiss added the enhancement New feature or request label May 18, 2026
@CLAassistant

CLAassistant commented May 18, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@gregorweiss gregorweiss linked an issue May 18, 2026 that may be closed by this pull request
* Add SlurmConfig.from_cluster() classmethod for autodiscovery

* Use autodiscovery fallback when --account not provided in SLURM commands

* Add tests for SlurmConfig.from_cluster() autodiscovery

* Add tests for mdfactory config cluster CLI command
@gregorweiss gregorweiss force-pushed the worktree-slurm-cluster-autodiscovery branch from 1d6b0ad to c4e9376 Compare June 1, 2026 12:30
@gregorweiss gregorweiss changed the title WIP feat: add slurm cluster autodiscovery feat: add slurm cluster autodiscovery Jun 1, 2026
Comment thread mdfactory/performance/cluster.py Outdated
Comment on lines +32 to +55
@dataclass(frozen=True)
class NodeType:
"""Hardware specification of a node type within a partition.

Parameters
----------
cpus : int
Number of CPU cores per node.
memory_mb : int
Memory in megabytes per node.
gpu_specs : tuple of (int, str)
GPU specifications as (count, type) tuples. Empty if CPU-only.
Multiple entries represent different GPU types on the same node.
features : tuple of str
SLURM feature/constraint tags on this node type (immutable).
count : int
Number of nodes with this exact configuration.
"""

cpus: int
memory_mb: int
gpu_specs: tuple[tuple[int, str], ...] = field(default_factory=tuple)
features: tuple[str, ...] = field(default_factory=tuple)
count: int = 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use pydantic everywhere else in mdfactory, so would be great to use that throughout.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread mdfactory/performance/cluster.py Outdated
default_account: str | None = None


def _run_command(cmd: list[str], *, timeout: int = 30) -> str | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if we have this utility function somewhere else already.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've refactored this into a generalized run_command() utility in mdfactory/utils/utilities.py.

The cluster autodiscovery now uses this instead of its local _run_command(). I also identified several other subprocess calls that could benefit from this utility (in cli.py, sync_config.py, and bilayer/artifacts/__init__.py), but didn't want to expand the scope of this PR beyond cluster autodiscovery.

I've created issue #19 to track migrating those locations separately - keeping this PR focused while ensuring the broader refactoring doesn't get lost.

@gregorweiss

Copy link
Copy Markdown
Collaborator Author

The PR #11 scope from #20 is complete on this branch: BaseSlurmConfig extracted to performance/slurm_config.py, SlurmConfig migrated to Pydantic, normalize_slurm_time moved, from_yaml() added, config.ini split into PARTITION_CPU/PARTITION_GPU/DEFAULT_QOS, tests updated. Re-export shim in analysis/submit.py keeps existing callers intact.

@maxscheurer maxscheurer mentioned this pull request Jun 11, 2026
9 tasks
@gregorweiss gregorweiss changed the base branch from main to develop June 11, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SLURM cluster autodiscovery

3 participants