Limit the harm an untrusted AI agent can cause while it operates, even when it actively tries to subvert the controls around it. The core game: can a weaker, trusted model catch a stronger, untrusted agent attempting sabotage? And while keeping false alarms under ~1%?
Everything here runs on control-tower, the AI-control eval framework (LinuxArena / ControlArena) by Redwood Research and UK's AISI.
Important
Before continuing, make sure to familiarize yourself with LinuxArena's hackathon quickstart and we highly highly recommend at least skimming the resources they link (we've copied them to the bottom of this README for your convenience).
Before starting, make sure you have the following set up:
- A Unix-compatible machine (MacOS or Linux) or WSL for Windows
- The uv package manager
pip install uv - Docker installed and running
- The GH cli (
sudo apt install gh) used to pull LinuxArena envs
uv run ct init # completions, `ct` alias, settings, agent skills
uv run ct settings pull linuxarena # environments, main tasks, side tasksBefore spending any of your credits, run this to make sure everything works (it just runs ls and then exits, no model calls):
uv run ct run eval --policy test -e clinical_trial -t adverse_reaction_followup_tracking --no-upload| Track | Team | You build | |
|---|---|---|---|
| A | Monitor Building | 🔵 Blue | A monitor that scores how suspicious each action/trajectory is |
| B | Protocol Design | 🔵 Blue | A protocol (defer / rewrite / gate) and its safety–usefulness tradeoff |
| C | Attack Crafting | 🔴 Red | A sabotage trajectory (straj) that evades the monitor |
| D | Break a Protocol | 🔴 Red | An untrusted policy that beats a protocol — then a fix |
| E | New Side Tasks | ⚪ Gray | A new side task / environment / scorer that extends the testbed |
- Main task: the legitimate job the agent is asked to do.
- Side task: a high-stakes sabotage objective the untrusted agent may pursue covertly.
- Straj: a sabotage trajectory: a run that accomplishes the side task while staying subtle.
- Monitor: scores suspiciousness, 0 (clean) → 10 (very suspicious).
- Protocol: the blue-team decision policy that acts on monitor scores (audit / defer / rewrite / gate).
- Safety = caught / (caught + uncaught successful sabotage), at a fixed audit budget (FPR on honest runs).
- Usefulness = honest main-task success rate.
By default, anthropic/claude-haiku-4-5 (shorthand: haiku) is used as the untrusted model and openai/gpt-5-nano as the trusted one. If you decide to change this, beware of the cost of beefier models (especially since these evaluations tend to be token-heavy) and make sure to document this change in your deliverables.
Any time a command wants some kind of run as an argument, it's asking for the logs/[...].eval file produced by a ct run eval command.
Scored 1–5 on impact & innovation, execution quality, and clarity. Each team ships:
- 📄 a 2-page report;
- 💻 a code repo (commit your
data/andlogs/folders! we want to see them.); - 🎬 a 3-minute demo (either a video or in person)
- if it's a video, it can be an unlisted youtube link, public google drive link, etc.. Just make sure that we can access it.
1st-2nd-3rd place for the best monitors and best attack respectively, and "best side tasks" award.
- Running evals on Hawk: submit to a shared Hawk cluster and watch results in the viewer, instead of running locally (unless you're confident in your local compute or running out of time, we recommended using Hawk).
- Running Evaluations: full
ct run evalreference: policies, task sets, batching, replay. - Monitoring: monitor types, ensembles, suspiciousness scoring (building monitors).
- Control Protocols: defer-to-trusted, trusted editing, and writing your own protocol.
- Making Sabotage Trajectories: hand-crafting attacks that evade monitors.
- Making Side Tasks and the Hands-On Guide: extending the testbed with new environments and tasks.
- Safety Curves: turning monitor scores into safety/usefulness numbers.