A self-directed course I'm working through to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested in agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.
The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.
A note on authorship. The notebook scaffolds were LLM-generated to my specification: the structure, the verified API calls, and the synthetic warm-ups. The load-bearing work — the # TODO exercises, the real-model experiments, the results, and the write-ups — is mine, done by hand as I work through the course. The Progress section below tracks exactly what has been completed versus what is still scaffold, and each Reproductions rung gets a RESULTS.md experiment log as I finish it.
Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.
| Part | Directory | What it is |
|---|---|---|
| 1 — Techniques | notebook-syllabus/ |
12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks. |
| 2 — Reproductions | Reproductions/ |
7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing. |
| Capstone target | projects/tact-reproduction/ |
The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data. |
The full paper list, with public sources, is in reading-list.md.
Scaffolds exist for everything below; a ticked box means I have worked through it myself — exercises filled, real runs executed, results committed. Until a box is ticked, treat that notebook as a plan, not a result.
Part 1 — techniques (00–02 are committed with executed outputs; 03–11 ship unexecuted)
- 00 Geometry and the residual stream
- 01 TransformerLens: hooks, the cache, and the experimental loop
- 02 DLA, logit lens, tuned lens
- 03 Linear probes: decoding vs causality
- 04 Patching: activation, attribution, path
- 05 Sparse autoencoders
- 06 Steering and representation engineering
- 07 Circuit discovery and validation
- 08 Knowledge localization and model editing
- 09 How to reimplement a paper and design a mini-project
- 10 From big ideas to small ~1B-model experiments
- 11 Induction heads: a known-circuit case study
Part 2 — reproductions (each rung gets a RESULTS.md experiment log as it is worked)
- 01 Agent trace evals — real trajectories collected, judge labels validated
- 02 Linear probes on agent states — layerwise AUROC on real Qwen3 hidden states
- 03 Activation steering — on-target and off-target effects measured
- 04 Sparse autoencoders — probe-vs-feature comparison run
- 05 Long-conversation monitoring — per-turn drift traces on a real conversation
- 06 TACT capstone —
label_stepandmake_steering_hookimplemented, end to end on real trajectories - 07 Circuit tracing — optional breadth, needs GPU/Colab
Milestones
- GPT-2 from scratch — attention/MLP/LayerNorm by hand, logits verified against
gpt2-small - Capstone write-up posted — results, controls, what broke; a clean negative result counts
| Area | Technique | Where |
|---|---|---|
| Probing | linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes | syllabus 03, Reproductions 02 |
| Steering / RepE | contrast-vector steering via forward hooks, side-effect analysis | syllabus 06, Reproductions 03 |
| Sparse autoencoders | dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope) | syllabus 05, Reproductions 04 |
| Activation patching | path & attribution patching, causal interventions | syllabus 04 |
| Circuit discovery | ACDC-style pruning, induction-head case study | syllabus 07, 11 |
| Readout methods | logit lens, tuned lens, direct logit attribution | syllabus 02 |
| Knowledge localisation | ROME-style factual editing | syllabus 08 |
| Attribution graphs | transcoders + circuit tracing (Anthropic 2025) | Reproductions 07 |
| Black-box agent evals | Inspect / Petri, LLM-as-judge, scenario design | Reproductions 01 |
| Trajectory monitoring | per-turn projection, representation drift over long context | Reproductions 05 |
One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).
$ setup.sh
$ start.shstart.sh serves JupyterLab at a fixed address — http://localhost:8888/lab?token=interp — so a tab left open from a previous session reconnects cleanly instead of hanging at [*]. Open any notebook and select the Interpretability kernel.
A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.
If a Part 1 notebook misbehaves on
transformers4.57.x, pintransformers==4.46.3(the version those notebooks were originally verified against) in a separate environment.
# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -qnotebook-syllabus/ Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/ Part 2 — 7 reproduction rungs (each: notebook, README, RESULTS.md log, tests)
projects/tact-reproduction/ capstone target — synthetic TACT mechanism + tested axis math
reading-list.md every paper, with public sources
requirements.txt the shared environment