Diffusion-based system emulation framework. Trains a conditional diffusion model to generate executable workloads that reproduce hardware performance traces of target applications.
- Requirements
- Installation
- 1. Target Machine Setup
- 2. Data Collection
- 3. Training
- 4. RL Fine-tuning
- 5. Inference Server & Client
- Python 3.10
- CUDA 12.x
- uv (
curl -LsSf https://astral.sh/uv/install.sh | sh)
uv syncThis creates a .venv in the repo root, installs all pinned dependencies from uv.lock, and installs the mimesys package in editable mode. To run any command inside the environment, prefix it with uv run.
To activate the venv directly (optional):
source .venv/bin/activateData collection requires a cluster of machines accessible over SSH. Each node runs the benchmark and ships results back to the controller. The scripts in worker_scripts/ automate this setup.
Fill in worker_scripts/config.py with your credentials and the list of worker hostnames:
USERNAME = "your_username"
PRIVATE_KEY_PATH = "~/.ssh/id_rsa"
HOSTNAMES = ["worker-01.example.com", "worker-02.example.com", ...]Run the following command to install all required dependencies for executing synthesized workloads and collecting profiling metrics.
Command
cd worker_scripts
uv run python install_remote_dependencies.pyHow collection works on each machine
During each active-learning round, the controller:
- Writes execution plans (HDF5 files) to per-node directories and transfers them over SSH.
- Each node runs
worker_scripts/collect_mimesys_metrics.sh, which invokes the benchmark with the assigned plans and collects hardware performance counters. - Results are zipped, copied back to the controller via
scp, then parsed and filtered before being added to the training dataset.
Training data is collected via novelty-guided active learning (collection/collect_training_data.py). The loop iteratively profiles stressor compositions on remote machines and prioritizes those likely to produce underexplored resource-usage patterns.
Command
cd mimesys/collection
uv run python collect_training_data.py --rounds 20--rounds sets the number of active-learning rounds after the initial sweep (round 0). Output is written to the path configured in OUTPUT_PATH at the top of the script.
Overview
Round 0 — Initialization
- A one-hot sweep (initial_candidates) covers each of the stressors in isolation across varying thread counts and weight scales. This anchors the dataset with ground-truth single-stressor behavior at the extremes of the metric space.
Rounds 1+ — Active Learning
- Each round proposes a batch of stressor compositions using a hybrid of two complementary strategies (novelty / convex hull interpolation, controlled by HULL_FPS_RATIO):
- RF novelty + FPS selection — trains a Random Forest surrogate on the current dataset. A large pool of candidates is generated by mixing mutations of existing actions with fresh random compositions. Each candidate is scored by:
- Rarity: negative log-density under a KDE fit to the observed metric distribution (low density = underrepresented region)
- Uncertainty: log-determinant of the RF per-tree prediction covariance (high variance = low-confidence region)
The combined novelty score modulates a greedy selection in predicted metric space, ensuring the final batch is both novel and mutually diverse.
- Convex-hull interpolation — builds a convex hull of the observed metric space across selected metric subspaces. Grid cells inside the hull are sorted by occupancy (empty first), and new compositions are synthesized by interpolating the actions of their nearest neighbors, targeting gaps in the covered metric space.
After profiling, each round filters out high-variance measurements (per-metric variance exceeding 10% of the current observed range) before adding them to the dataset, keeping only stable, reproducible traces.
All training commands are run from mimesys/training/.
Supervised pretraining
cd mimesys/training
CUDA_VISIBLE_DEVICES=0 uv run python trainer.py +exps=pretrainMulti-GPU
CUDA_VISIBLE_DEVICES=0,1 uv run python trainer.py +exps=pretrain \
train.trainer.devices=2 \
train.trainer.strategy=ddpResume from checkpoint
uv run python trainer.py +exps=pretrain \
train.trainer.ckpt_path=/path/to/diffusion-epoch=999.ckptTraining config example
# mimesys/conf/exps/concat_aug10.yaml (train + model sections)
train:
trainer:
trainer_model_name: MimesysTrainer
max_epochs: 2000
devices: 1
precision: 16-mixed # inherited from base
check_val_every_n_epoch: 500
ckpt_path: ""
optim:
lr: 1e-4
callbacks:
checkpoint:
dirpath: diffusion/concat_aug10
every_n_epochs: 500
monitor: epoch
save_top_k: -1
save_last: true
run_train: true
run_test: false
use_rl: false
prev_state_lambda: 0.0
model:
unet:
input_dim: 20 # num_stressor_types × num_parameters
context:
input_dim: 25 # trace feature dimension
action_dim: 260
num_heads: 4
num_layers: 6
hidden_dim: 256
dropout: 0.1
encoder_type: concat
diffusion:
n_timesteps: 25
cfg_args:
cfg_drop_prob: 0.1
cfg_guide_w: 3
log:
project_name: mimesys
run_name: diffusion/concat_aug10Training metrics are logged to Weights & Biases under log.project_name / log.run_name.
RL fine-tuning uses DDPO with a profiling-based reward. Start from a pretrained supervised checkpoint.
Command
cd mimesys/training
CUDA_VISIBLE_DEVICES=0 uv run python trainer.py +exps=rl_finetuningRL config example
# mimesys/conf/exps/rl_finetuning.yaml (train section)
train:
trainer:
trainer_model_name: MimesysTrainer
max_epochs: 10000
devices: 1
check_val_every_n_epoch: 10
ckpt_path: /path/to/pretrained/diffusion-epoch=999.ckpt # start from supervised ckpt
optim:
lr: 3e-7 # keep low to avoid catastrophic forgetting
use_rl: true
prev_state_lambda: 0.0
ddpo:
num_inner_epochs: 1
num_batches_per_episode: 14
reward_type: profiling # live benchmark reward via SSH
io_reward_weight: 1.0
kl_coef: 0.05 # KL penalty to pretrained distribution
callbacks:
checkpoint:
dirpath: diffusion/mimesys_pretrain
every_n_epochs: 10
monitor: epoch
save_last: true
async_validation: false
log:
project_name: mimesys
run_name: diffusion/mimesys_pretrainThe profiler section must be set (same SSH credentials as data collection). Remote machines run the benchmark and return the reward signal each episode.
uv run python -m mimesys.inference \
--ckpt /path/to/diffusion-epoch=999.ckpt
# Custom port and experiment config
uv run python -m mimesys.inference \
--ckpt /path/to/last.ckpt \
--exp pretrain \
--port 8000
# With remote profiling endpoint enabled
uv run python -m mimesys.inference \
--ckpt /path/to/last.ckpt \
--enable_profiling
# Choose devices
uv run python -m mimesys.inference \
--ckpt /path/to/last.ckpt \
--device cpu # or cuda| Flag | Default | Description |
|---|---|---|
--ckpt |
required | Path to .ckpt checkpoint |
--exp |
pretrain |
Hydra experiment config name |
--port |
8000 |
HTTP port |
--host |
0.0.0.0 |
Bind address |
--enable_profiling |
off | Enable POST /profile (requires CloudLab SSH) |
--device |
auto | Force cuda or cpu |
Generate an execution plan (HDF5) from a time-series resource usage trace file. The trace file uses the HPCPerfStats format.
uv run python -m mimesys.inference.client generate-from-file \
--file /path/to/stats-workload.txt \
--method diffusion \
--output execution_plan_series.h5Using the generated h5 file, you can run a synthetic workload on a target machine:
# From the `fleetbench` directory on the target machine:
# Copy the h5 file to the machine first, then place it in the execution plans directory
cp execution_plan_series.h5 fleetbench/mimesys/execution_plans/
# Run the synthetic workload
MIMESYS_ITERS=1 MIMESYS_SLEEP=0 ACTION_PROFILING_CACHE_DIR=${HOME_PATH}/fleetbench ACTION_LIST_PATH=${HOME_PATH}/fleetbench/fleetbench/mimesys/mimesys_actions.txt TACC_STATS_DIR=${HOME_PATH}/HPCPerfStats/monitor/src sudo bazel run --config=clang --config=opt fleetbench/mimesys:mimesys_benchmark -- --benchmark_filter="BM_Mimesys"To generate a workload and profile it in a single command:
uv run python -m mimesys.inference.client profile-from-file \
--file /path/to/stats-workload.txt \
--method diffusion \
--output metrics.png