feat: add slurm cluster autodiscovery#11
Conversation
* Add SlurmConfig.from_cluster() classmethod for autodiscovery * Use autodiscovery fallback when --account not provided in SLURM commands * Add tests for SlurmConfig.from_cluster() autodiscovery * Add tests for mdfactory config cluster CLI command
1d6b0ad to
c4e9376
Compare
| @dataclass(frozen=True) | ||
| class NodeType: | ||
| """Hardware specification of a node type within a partition. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| cpus : int | ||
| Number of CPU cores per node. | ||
| memory_mb : int | ||
| Memory in megabytes per node. | ||
| gpu_specs : tuple of (int, str) | ||
| GPU specifications as (count, type) tuples. Empty if CPU-only. | ||
| Multiple entries represent different GPU types on the same node. | ||
| features : tuple of str | ||
| SLURM feature/constraint tags on this node type (immutable). | ||
| count : int | ||
| Number of nodes with this exact configuration. | ||
| """ | ||
|
|
||
| cpus: int | ||
| memory_mb: int | ||
| gpu_specs: tuple[tuple[int, str], ...] = field(default_factory=tuple) | ||
| features: tuple[str, ...] = field(default_factory=tuple) | ||
| count: int = 1 |
There was a problem hiding this comment.
We use pydantic everywhere else in mdfactory, so would be great to use that throughout.
| default_account: str | None = None | ||
|
|
||
|
|
||
| def _run_command(cmd: list[str], *, timeout: int = 30) -> str | None: |
There was a problem hiding this comment.
Check if we have this utility function somewhere else already.
There was a problem hiding this comment.
Good catch! I've refactored this into a generalized run_command() utility in mdfactory/utils/utilities.py.
The cluster autodiscovery now uses this instead of its local _run_command(). I also identified several other subprocess calls that could benefit from this utility (in cli.py, sync_config.py, and bilayer/artifacts/__init__.py), but didn't want to expand the scope of this PR beyond cluster autodiscovery.
I've created issue #19 to track migrating those locations separately - keeping this PR focused while ensuring the broader refactoring doesn't get lost.
8e5eef0 to
c0b6d34
Compare
… from performance
…settings properties
… model, from_yaml
…n, slurm_partition_cpu
…tighten docstring
|
The PR #11 scope from #20 is complete on this branch: |
Adds
mdfactory/performance/cluster.py— queriessinfoandsacctmgrto discover cluster resources automatically. This is the foundation for all planned HPC features (benchmarking, node packing, GPU sharing).What it does
discover_cluster()→ returns partitions, node specs (CPUs, memory, GPUs), accounts, and QOS policies as frozen dataclassesselect_partition(cluster, needs_gpu=True, min_cpus=64)→ picks the best partition for given requirementsNonegracefully on non-SLURM machinesFiles
mdfactory/performance/__init__.py— new packagemdfactory/performance/cluster.py— autodiscovery implementationmdfactory/tests/test_cluster.py— unit tests with mocked SLURM outputTest plan
pytest mdfactory/tests/test_cluster.py -v(mocked, no cluster needed)discover_cluster()returnsNoneon laptopClusterInfoon a real SLURM node