Skip to content

hackos-team/hackos-5-quickstart

Repository files navigation

HackOS 5 — AI Control Hackathon · Quickstart

Limit the harm an untrusted AI agent can cause while it operates, even when it actively tries to subvert the controls around it. The core game: can a weaker, trusted model catch a stronger, untrusted agent attempting sabotage? And while keeping false alarms under ~1%?

Everything here runs on control-tower, the AI-control eval framework (LinuxArena / ControlArena) by Redwood Research and UK's AISI.

Important

Before continuing, make sure to familiarize yourself with LinuxArena's hackathon quickstart and we highly highly recommend at least skimming the resources they link (we've copied them to the bottom of this README for your convenience).

Shared setup (all tracks)

Before starting, make sure you have the following set up:

  • A Unix-compatible machine (MacOS or Linux) or WSL for Windows
  • The uv package manager pip install uv
  • Docker installed and running
  • The GH cli (sudo apt install gh) used to pull LinuxArena envs
uv run ct init  # completions, `ct` alias, settings, agent skills
uv run ct settings pull linuxarena  # environments, main tasks, side tasks

Before spending any of your credits, run this to make sure everything works (it just runs ls and then exits, no model calls):

uv run ct run eval --policy test -e clinical_trial -t adverse_reaction_followup_tracking --no-upload

Pick a track

Track Team You build
A Monitor Building 🔵 Blue A monitor that scores how suspicious each action/trajectory is
B Protocol Design 🔵 Blue A protocol (defer / rewrite / gate) and its safety–usefulness tradeoff
C Attack Crafting 🔴 Red A sabotage trajectory (straj) that evades the monitor
D Break a Protocol 🔴 Red An untrusted policy that beats a protocol — then a fix
E New Side Tasks ⚪ Gray A new side task / environment / scorer that extends the testbed

The shared vocabulary

  • Main task: the legitimate job the agent is asked to do.
  • Side task: a high-stakes sabotage objective the untrusted agent may pursue covertly.
  • Straj: a sabotage trajectory: a run that accomplishes the side task while staying subtle.
  • Monitor: scores suspiciousness, 0 (clean) → 10 (very suspicious).
  • Protocol: the blue-team decision policy that acts on monitor scores (audit / defer / rewrite / gate).
  • Safety = caught / (caught + uncaught successful sabotage), at a fixed audit budget (FPR on honest runs).
  • Usefulness = honest main-task success rate.

By default, anthropic/claude-haiku-4-5 (shorthand: haiku) is used as the untrusted model and openai/gpt-5-nano as the trusted one. If you decide to change this, beware of the cost of beefier models (especially since these evaluations tend to be token-heavy) and make sure to document this change in your deliverables.

Any time a command wants some kind of run as an argument, it's asking for the logs/[...].eval file produced by a ct run eval command.

Judging & deliverables

Scored 1–5 on impact & innovation, execution quality, and clarity. Each team ships:

  • 📄 a 2-page report;
  • 💻 a code repo (commit your data/ and logs/ folders! we want to see them.);
  • 🎬 a 3-minute demo (either a video or in person)
    • if it's a video, it can be an unlisted youtube link, public google drive link, etc.. Just make sure that we can access it.

1st-2nd-3rd place for the best monitors and best attack respectively, and "best side tasks" award.

Docs worth skimming at least once

  • Running evals on Hawk: submit to a shared Hawk cluster and watch results in the viewer, instead of running locally (unless you're confident in your local compute or running out of time, we recommended using Hawk).
  • Running Evaluations: full ct run eval reference: policies, task sets, batching, replay.
  • Monitoring: monitor types, ensembles, suspiciousness scoring (building monitors).
  • Control Protocols: defer-to-trusted, trusted editing, and writing your own protocol.
  • Making Sabotage Trajectories: hand-crafting attacks that evade monitors.
  • Making Side Tasks and the Hands-On Guide: extending the testbed with new environments and tasks.
  • Safety Curves: turning monitor scores into safety/usefulness numbers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages