Reproducible ML benchmark suite in Rust — proving edge-native training is viable, one algorithm at a time.
Part of the torchforge-rs ecosystem.
The long-term target of the torchforge-rs ecosystem is Federated Deep Reinforcement Learning (FDRL) at the edge: a fleet of constrained devices, each running a local DRL agent learning from its own physical environment, sharing only gradients with a coordinator. No cloud. No Python.
torchforge-bench is the proof layer for that target. The v1.x paper goal — the first reproducible FDRL benchmark suite in Rust for edge hardware — is only credible if the v0.x single-device numbers are rigorous first. Every methodology decision made here (hardware documentation, seed discipline, apples-to-apples comparison against CleanRL) is made with the federated case in mind: the same standards will apply across a fleet of devices.
This crate is v0.x infrastructure. FDRL is the v1.x target. The claim is not yet earned — the single-device benchmark has to land first.
Two facts are simultaneously true today:
- No published benchmark demonstrates a high-profile ML algorithm reproduced in Rust outperforming its Python reference implementation
- The absence of such a benchmark is the primary reason practitioners do not take Rust ML seriously
torchforge-bench exists to close this gap — not with claims, but with reproducible numbers on documented hardware.
The reference target is CleanRL — the most reproducible Python RL benchmark suite, with single-file implementations and published results. We reproduce CleanRL algorithms in Rust, measure against the same environments on the same hardware, and publish everything, including results where Rust is slower.
- Reproducibility over performance — a benchmark that cannot be reproduced is worthless
- Documented hardware — every published result names the exact machine it was run on
- Apples-to-apples — same environment, same hyperparameters, same metric definitions as the Python reference
- No cherry-picking — we benchmark what we build, not what makes us look good
- Honest about unknowns — if a result is surprising, we investigate before claiming
v0.0.1 — Pre-alpha. No benchmarks published yet. Three prerequisite research items block v0.1.0.
The repository structure, CI, governance documents, baseline infrastructure (baselines/ via uv), and results schema are complete. Before algorithm implementation begins:
[RESEARCH]— PyO3 FFI overhead onenv.step()must be measured and documented[RESEARCH]— Neural network backend prototype (burn+ndarrayvscandle) must be completed[RESEARCH]— CleanRL DQN Python baseline must be run on target hardware and results stored
See ARCHITECTURE.md for rationale and TODO.md for the full roadmap.
| Version | Goal |
|---|---|
| Pre-v0.1.0 | FFI overhead measurement, NN backend prototype, Python baseline (hard blockers) |
| v0.1.0 | DQN on CartPole-v1 — results published against CleanRL, including if slower |
| v0.2.0 | PPO on CartPole-v1 |
| v0.3.0 | Edge hardware benchmarks (ARM, Raspberry Pi 5 candidate) |
| v0.4.0 | SAC on continuous control |
| v1.0.0 | First stable benchmark, externally reviewed methodology |
| v1.x | First reproducible FDRL benchmark suite in Rust for edge hardware — arXiv target |
The v1.x FDRL benchmark is the north star. Every methodology decision made at v0.x is made with the requirement that it holds under the federated case: same hardware documentation, same seed discipline, same publication standards — applied across a fleet of devices rather than one.
| Algorithm | Reference | Environment | Status |
|---|---|---|---|
| DQN | CleanRL dqn.py |
CartPole-v1 | 🔲 Pre-requisite research |
| PPO | CleanRL ppo.py |
CartPole-v1 | 🔲 Blocked on v0.1.0 |
| SAC | CleanRL sac_continuous_action.py |
TBD | 🔲 Blocked on v0.3.0 |
Python baselines are managed via uv with a committed lockfile — reproducible installs, no loose pip.
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Run the DQN CartPole baseline
cd baselines/
uv run python dqn_cartpole.py
# Results are written to ../results/baselines/Rust benchmarks run via the standard cargo bench interface once v0.1.0 is implemented:
cargo bench --bench dqn_cartpoleEvery published result includes: exact hardware, OS, Rust version, Python/PyTorch version, seeds (minimum 5, mean ± std), wall-clock time (total and training-only), and peak memory. No result is published without the full methodology table. See ARCHITECTURE.md for the complete specification.
See CONTRIBUTING.md for the full guide — prerequisites (Rust, Python, uv), branching model, PR process, and the result reproducibility policy.
The most valuable contributions right now are:
- Running the CleanRL DQN baseline on your hardware and documenting results — this directly unblocks v0.1.0
- Running the NN backend prototypes (
burn+ndarrayvscandle) and reporting compile time, binary size, and autodiff correctness - Measuring PyO3 FFI overhead on
CartPole-v1 env.step() - Challenging assumptions in ARCHITECTURE.md
Open an issue before submitting a PR.
Please read our Code of Conduct before participating.
Benchmark result disputes are handled via the methodology_challenge issue template, not as security issues — see SECURITY.md for what does qualify.
Apache-2.0. See LICENSE.
CleanRL baseline scripts in baselines/ are MIT licensed — see baselines/README.md for attribution.
Part of the torchforge-rs ecosystem — also see torchforge-data and torchforge-viz.