Skip to content

tbtommyb/interpretability

Repository files navigation

Interpretability & Evals — a self-directed course

A self-directed course I'm working through to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested in agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.

The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.

A note on authorship. The notebook scaffolds were LLM-generated to my specification: the structure, the verified API calls, and the synthetic warm-ups. The load-bearing work — the # TODO exercises, the real-model experiments, the results, and the write-ups — is mine, done by hand as I work through the course. The Progress section below tracks exactly what has been completed versus what is still scaffold, and each Reproductions rung gets a RESULTS.md experiment log as I finish it.

Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.

What's here

Part Directory What it is
1 — Techniques notebook-syllabus/ 12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks.
2 — Reproductions Reproductions/ 7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing.
Capstone target projects/tact-reproduction/ The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data.

The full paper list, with public sources, is in reading-list.md.

Progress

Scaffolds exist for everything below; a ticked box means I have worked through it myself — exercises filled, real runs executed, results committed. Until a box is ticked, treat that notebook as a plan, not a result.

Part 1 — techniques (00–02 are committed with executed outputs; 03–11 ship unexecuted)

  • 00 Geometry and the residual stream
  • 01 TransformerLens: hooks, the cache, and the experimental loop
  • 02 DLA, logit lens, tuned lens
  • 03 Linear probes: decoding vs causality
  • 04 Patching: activation, attribution, path
  • 05 Sparse autoencoders
  • 06 Steering and representation engineering
  • 07 Circuit discovery and validation
  • 08 Knowledge localization and model editing
  • 09 How to reimplement a paper and design a mini-project
  • 10 From big ideas to small ~1B-model experiments
  • 11 Induction heads: a known-circuit case study

Part 2 — reproductions (each rung gets a RESULTS.md experiment log as it is worked)

  • 01 Agent trace evals — real trajectories collected, judge labels validated
  • 02 Linear probes on agent states — layerwise AUROC on real Qwen3 hidden states
  • 03 Activation steering — on-target and off-target effects measured
  • 04 Sparse autoencoders — probe-vs-feature comparison run
  • 05 Long-conversation monitoring — per-turn drift traces on a real conversation
  • 06 TACT capstone — label_step and make_steering_hook implemented, end to end on real trajectories
  • 07 Circuit tracing — optional breadth, needs GPU/Colab

Milestones

  • GPT-2 from scratch — attention/MLP/LayerNorm by hand, logits verified against gpt2-small
  • Capstone write-up posted — results, controls, what broke; a clean negative result counts

Techniques and papers covered

Area Technique Where
Probing linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes syllabus 03, Reproductions 02
Steering / RepE contrast-vector steering via forward hooks, side-effect analysis syllabus 06, Reproductions 03
Sparse autoencoders dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope) syllabus 05, Reproductions 04
Activation patching path & attribution patching, causal interventions syllabus 04
Circuit discovery ACDC-style pruning, induction-head case study syllabus 07, 11
Readout methods logit lens, tuned lens, direct logit attribution syllabus 02
Knowledge localisation ROME-style factual editing syllabus 08
Attribution graphs transcoders + circuit tracing (Anthropic 2025) Reproductions 07
Black-box agent evals Inspect / Petri, LLM-as-judge, scenario design Reproductions 01
Trajectory monitoring per-turn projection, representation drift over long context Reproductions 05

Setup

One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).

$ setup.sh
$ start.sh

start.sh serves JupyterLab at a fixed address — http://localhost:8888/lab?token=interp — so a tab left open from a previous session reconnects cleanly instead of hanging at [*]. Open any notebook and select the Interpretability kernel.

A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.

If a Part 1 notebook misbehaves on transformers 4.57.x, pin transformers==4.46.3 (the version those notebooks were originally verified against) in a separate environment.

Running the tests

# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -q

Layout

notebook-syllabus/       Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/           Part 2 — 7 reproduction rungs (each: notebook, README, RESULTS.md log, tests)
projects/tact-reproduction/   capstone target — synthetic TACT mechanism + tested axis math
reading-list.md          every paper, with public sources
requirements.txt         the shared environment

About

Notebooks to develop skills and reproduce key LLM interpretability papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors