Skip to content

torchforge-rs/torchforge-bench

torchforge-bench

Reproducible ML benchmark suite in Rust — proving edge-native training is viable, one algorithm at a time.

Part of the torchforge-rs ecosystem.

Crates.io Docs.rs License: Apache-2.0 Rust CI


The Destination

The long-term target of the torchforge-rs ecosystem is Federated Deep Reinforcement Learning (FDRL) at the edge: a fleet of constrained devices, each running a local DRL agent learning from its own physical environment, sharing only gradients with a coordinator. No cloud. No Python.

torchforge-bench is the proof layer for that target. The v1.x paper goal — the first reproducible FDRL benchmark suite in Rust for edge hardware — is only credible if the v0.x single-device numbers are rigorous first. Every methodology decision made here (hardware documentation, seed discipline, apples-to-apples comparison against CleanRL) is made with the federated case in mind: the same standards will apply across a fleet of devices.

This crate is v0.x infrastructure. FDRL is the v1.x target. The claim is not yet earned — the single-device benchmark has to land first.


Why

Two facts are simultaneously true today:

  1. No published benchmark demonstrates a high-profile ML algorithm reproduced in Rust outperforming its Python reference implementation
  2. The absence of such a benchmark is the primary reason practitioners do not take Rust ML seriously

torchforge-bench exists to close this gap — not with claims, but with reproducible numbers on documented hardware.

The reference target is CleanRL — the most reproducible Python RL benchmark suite, with single-file implementations and published results. We reproduce CleanRL algorithms in Rust, measure against the same environments on the same hardware, and publish everything, including results where Rust is slower.


Design Principles

  1. Reproducibility over performance — a benchmark that cannot be reproduced is worthless
  2. Documented hardware — every published result names the exact machine it was run on
  3. Apples-to-apples — same environment, same hyperparameters, same metric definitions as the Python reference
  4. No cherry-picking — we benchmark what we build, not what makes us look good
  5. Honest about unknowns — if a result is surprising, we investigate before claiming

Status

v0.0.1 — Pre-alpha. No benchmarks published yet. Three prerequisite research items block v0.1.0.

The repository structure, CI, governance documents, baseline infrastructure (baselines/ via uv), and results schema are complete. Before algorithm implementation begins:

  1. [RESEARCH] — PyO3 FFI overhead on env.step() must be measured and documented
  2. [RESEARCH] — Neural network backend prototype (burn+ndarray vs candle) must be completed
  3. [RESEARCH] — CleanRL DQN Python baseline must be run on target hardware and results stored

See ARCHITECTURE.md for rationale and TODO.md for the full roadmap.


Roadmap

Version Goal
Pre-v0.1.0 FFI overhead measurement, NN backend prototype, Python baseline (hard blockers)
v0.1.0 DQN on CartPole-v1 — results published against CleanRL, including if slower
v0.2.0 PPO on CartPole-v1
v0.3.0 Edge hardware benchmarks (ARM, Raspberry Pi 5 candidate)
v0.4.0 SAC on continuous control
v1.0.0 First stable benchmark, externally reviewed methodology
v1.x First reproducible FDRL benchmark suite in Rust for edge hardware — arXiv target

The v1.x FDRL benchmark is the north star. Every methodology decision made at v0.x is made with the requirement that it holds under the federated case: same hardware documentation, same seed discipline, same publication standards — applied across a fleet of devices rather than one.


Algorithm Targets

Algorithm Reference Environment Status
DQN CleanRL dqn.py CartPole-v1 🔲 Pre-requisite research
PPO CleanRL ppo.py CartPole-v1 🔲 Blocked on v0.1.0
SAC CleanRL sac_continuous_action.py TBD 🔲 Blocked on v0.3.0

Reproducing Baselines

Python baselines are managed via uv with a committed lockfile — reproducible installs, no loose pip.

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Run the DQN CartPole baseline
cd baselines/
uv run python dqn_cartpole.py

# Results are written to ../results/baselines/

Rust benchmarks run via the standard cargo bench interface once v0.1.0 is implemented:

cargo bench --bench dqn_cartpole

Every published result includes: exact hardware, OS, Rust version, Python/PyTorch version, seeds (minimum 5, mean ± std), wall-clock time (total and training-only), and peak memory. No result is published without the full methodology table. See ARCHITECTURE.md for the complete specification.


Contributing

See CONTRIBUTING.md for the full guide — prerequisites (Rust, Python, uv), branching model, PR process, and the result reproducibility policy.

The most valuable contributions right now are:

  • Running the CleanRL DQN baseline on your hardware and documenting results — this directly unblocks v0.1.0
  • Running the NN backend prototypes (burn+ndarray vs candle) and reporting compile time, binary size, and autodiff correctness
  • Measuring PyO3 FFI overhead on CartPole-v1 env.step()
  • Challenging assumptions in ARCHITECTURE.md

Open an issue before submitting a PR.

Please read our Code of Conduct before participating. Benchmark result disputes are handled via the methodology_challenge issue template, not as security issues — see SECURITY.md for what does qualify.


License

Apache-2.0. See LICENSE. CleanRL baseline scripts in baselines/ are MIT licensed — see baselines/README.md for attribution.

Part of the torchforge-rs ecosystem — also see torchforge-data and torchforge-viz.

About

Reproducible ML benchmark suite in Rust

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors