HackOS 5 — AI Control Hackathon · Quickstart

Limit the harm an untrusted AI agent can cause while it operates, even when it actively tries to subvert the controls around it. The core game: can a weaker, trusted model catch a stronger, untrusted agent attempting sabotage? And while keeping false alarms under ~1%?

Everything here runs on control-tower, the AI-control eval framework (LinuxArena / ControlArena) by Redwood Research and UK's AISI.

Important

Before continuing, make sure to familiarize yourself with LinuxArena's hackathon quickstart and we highly highly recommend at least skimming the resources they link (we've copied them to the bottom of this README for your convenience).

Shared setup (all tracks)

Before starting, make sure you have the following set up:

A Unix-compatible machine (MacOS or Linux) or WSL for Windows
The uv package manager pip install uv
Docker installed and running
The GH cli (sudo apt install gh) used to pull LinuxArena envs

uv run ct init  # completions, `ct` alias, settings, agent skills
uv run ct settings pull linuxarena  # environments, main tasks, side tasks

Before spending any of your credits, run this to make sure everything works (it just runs ls and then exits, no model calls):

uv run ct run eval --policy test -e clinical_trial -t adverse_reaction_followup_tracking --no-upload

Pick a track

	Track	Team	You build
A	Monitor Building	🔵 Blue	A monitor that scores how suspicious each action/trajectory is
B	Protocol Design	🔵 Blue	A protocol (defer / rewrite / gate) and its safety–usefulness tradeoff
C	Attack Crafting	🔴 Red	A sabotage trajectory (straj) that evades the monitor
D	Break a Protocol	🔴 Red	An untrusted policy that beats a protocol — then a fix
E	New Side Tasks	⚪ Gray	A new side task / environment / scorer that extends the testbed

The shared vocabulary

Main task: the legitimate job the agent is asked to do.
Side task: a high-stakes sabotage objective the untrusted agent may pursue covertly.
Straj: a sabotage trajectory: a run that accomplishes the side task while staying subtle.
Monitor: scores suspiciousness, 0 (clean) → 10 (very suspicious).
Protocol: the blue-team decision policy that acts on monitor scores (audit / defer / rewrite / gate).
Safety = caught / (caught + uncaught successful sabotage), at a fixed audit budget (FPR on honest runs).
Usefulness = honest main-task success rate.

By default, anthropic/claude-haiku-4-5 (shorthand: haiku) is used as the untrusted model and openai/gpt-5-nano as the trusted one. If you decide to change this, beware of the cost of beefier models (especially since these evaluations tend to be token-heavy) and make sure to document this change in your deliverables.

Any time a command wants some kind of run as an argument, it's asking for the logs/[...].eval file produced by a ct run eval command.

Judging & deliverables

Scored 1–5 on impact & innovation, execution quality, and clarity. Each team ships:

📄 a 2-page report;
💻 a code repo (commit your data/ and logs/ folders! we want to see them.);
🎬 a 3-minute demo (either a video or in person)
- if it's a video, it can be an unlisted youtube link, public google drive link, etc.. Just make sure that we can access it.

1st-2nd-3rd place for the best monitors and best attack respectively, and "best side tasks" award.

Docs worth skimming at least once

Running evals on Hawk: submit to a shared Hawk cluster and watch results in the viewer, instead of running locally (unless you're confident in your local compute or running out of time, we recommended using Hawk).
Running Evaluations: full ct run eval reference: policies, task sets, batching, replay.
Monitoring: monitor types, ensembles, suspiciousness scoring (building monitors).
Control Protocols: defer-to-trusted, trusted editing, and writing your own protocol.
Making Sabotage Trajectories: hand-crafting attacks that evade monitors.
Making Side Tasks and the Hands-On Guide: extending the testbed with new environments and tasks.
Safety Curves: turning monitor scores into safety/usefulness numbers.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
A_MonitorBuilding		A_MonitorBuilding
B_ProtocolDesign		B_ProtocolDesign
C_AttackCrafting		C_AttackCrafting
D_ProtocolBreaking		D_ProtocolBreaking
E_NewControls		E_NewControls
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HackOS 5 — AI Control Hackathon · Quickstart

Shared setup (all tracks)

Pick a track

The shared vocabulary

Judging & deliverables

Docs worth skimming at least once

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HackOS 5 — AI Control Hackathon · Quickstart

Shared setup (all tracks)

Pick a track

The shared vocabulary

Judging & deliverables

Docs worth skimming at least once

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages