AI agent evaluation framework for full trajectories: tasks, actions, observations, verified final state, rewards, baselines, and RL-ready exports.
-
Updated
May 12, 2026 - Python
AI agent evaluation framework for full trajectories: tasks, actions, observations, verified final state, rewards, baselines, and RL-ready exports.
Public-safe synthetic agentic bio-safeguard eval. 26 cases / 52 fixtures / 9 deterministic hard gates / Replay Ledger export. Not a capability benchmark.
Add a description, image, and links to the agentic-evals topic page so that developers can more easily learn about it.
To associate your repository with the agentic-evals topic, visit your repo's landing page and select "manage topics."