Interpretability & Evals — a self-directed course

A self-directed course I'm working through to learn the key AI interpretability techniques and work up to reproducing and extending papers. I'm interested in agent system level interpretability, so I've set TACT —- detecting and steering coding-agent drift in the residual stream (Sui et al., 2026) -- as my target.

The notebooks build up from foundational mathematics, through the key interpretability techniques (black and white box) up to a laptop-scale reproduction of the TACT paper.

A note on authorship. The notebook scaffolds were LLM-generated to my specification: the structure, the verified API calls, and the synthetic warm-ups. The load-bearing work — the # TODO exercises, the real-model experiments, the results, and the write-ups — is mine, done by hand as I work through the course. The Progress section below tracks exactly what has been completed versus what is still scaffold, and each Reproductions rung gets a RESULTS.md experiment log as I finish it.

Everything is sized for a single Apple-Silicon laptop: small models (gpt2-small, pythia-70m, Qwen3-1.7B), pretrained SAEs, and synthetic warm-ups before any real model is touched.

What's here

Part	Directory	What it is
1 — Techniques	`notebook-syllabus/`	12 worked notebooks on the white-box toolkit: probes, SAEs, steering, activation patching, circuit discovery, logit/tuned lens, model editing — on toy, inspectable tasks.
2 — Reproductions	`Reproductions/`	7 paper reproductions, each building one capability toward the capstone: agent evals, linear probes, steering, SAEs, long-conversation monitoring, the TACT capstone, and circuit tracing.
Capstone target	`projects/tact-reproduction/`	The TACT mechanism in miniature: synthetic drift geometry, unit-tested axis math, and the "Replacement Point" the reproductions teach you to fill with real data.

The full paper list, with public sources, is in reading-list.md.

Progress

Scaffolds exist for everything below; a ticked box means I have worked through it myself — exercises filled, real runs executed, results committed. Until a box is ticked, treat that notebook as a plan, not a result.

Part 1 — techniques (00–02 are committed with executed outputs; 03–11 ship unexecuted)

Part 2 — reproductions (each rung gets a RESULTS.md experiment log as it is worked)

01 Agent trace evals — real trajectories collected, judge labels validated
02 Linear probes on agent states — layerwise AUROC on real Qwen3 hidden states
03 Activation steering — on-target and off-target effects measured
04 Sparse autoencoders — probe-vs-feature comparison run
05 Long-conversation monitoring — per-turn drift traces on a real conversation
06 TACT capstone — label_step and make_steering_hook implemented, end to end on real trajectories
07 Circuit tracing — optional breadth, needs GPU/Colab

Milestones

GPT-2 from scratch — attention/MLP/LayerNorm by hand, logits verified against gpt2-small
Capstone write-up posted — results, controls, what broke; a clean negative result counts

Techniques and papers covered

Area	Technique	Where
Probing	linear & unsupervised (CCS) probes, AUROC, deception / eval-awareness probes	syllabus 03, Reproductions 02
Steering / RepE	contrast-vector steering via forward hooks, side-effect analysis	syllabus 06, Reproductions 03
Sparse autoencoders	dictionary learning, TopK, pretrained SAEs (SAELens / Gemma Scope)	syllabus 05, Reproductions 04
Activation patching	path & attribution patching, causal interventions	syllabus 04
Circuit discovery	ACDC-style pruning, induction-head case study	syllabus 07, 11
Readout methods	logit lens, tuned lens, direct logit attribution	syllabus 02
Knowledge localisation	ROME-style factual editing	syllabus 08
Attribution graphs	transcoders + circuit tracing (Anthropic 2025)	Reproductions 07
Black-box agent evals	Inspect / Petri, LLM-as-judge, scenario design	Reproductions 01
Trajectory monitoring	per-turn projection, representation drift over long context	Reproductions 05

Setup

One shared environment runs the whole course. The only subtlety is the transformers version: TransformerLens (used in several notebooks) needs transformers < 5 (it relies on TRANSFORMERS_CACHE, removed in 5.0), while Qwen3 needs transformers >= 4.51. The range [4.51, 5.0) satisfies both, and requirements.txt pins it (verified with transformers 4.57.6 + transformer-lens 2.15.4 + sae-lens 6.5.3 on Python 3.12 / mps).

$ setup.sh
$ start.sh

start.sh serves JupyterLab at a fixed address — http://localhost:8888/lab?token=interp — so a tab left open from a previous session reconnects cleanly instead of hanging at [*]. Open any notebook and select the Interpretability kernel.

A few notebooks pull heavier or GPU-oriented tools installed separately — circuit-tracer (rung 07, GPU/Colab), mini-swe-agent (capstone trajectory collection), inspect-petri (rung 01). Each notebook says so above the relevant cell.

If a Part 1 notebook misbehaves on transformers 4.57.x, pin transformers==4.46.3 (the version those notebooks were originally verified against) in a separate environment.

Running the tests

# from a rung directory, e.g. Reproductions/02-linear-probes-on-agent-states
python -m pytest tests -q

Layout

notebook-syllabus/       Part 1 — 12 white-box technique notebooks + shared helpers
Reproductions/           Part 2 — 7 reproduction rungs (each: notebook, README, RESULTS.md log, tests)
projects/tact-reproduction/   capstone target — synthetic TACT mechanism + tested axis math
reading-list.md          every paper, with public sources
requirements.txt         the shared environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretability & Evals — a self-directed course

What's here

Progress

Techniques and papers covered

Setup

Running the tests

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Reproductions		Reproductions
notebook-syllabus		notebook-syllabus
projects/tact-reproduction		projects/tact-reproduction
.gitignore		.gitignore
README.md		README.md
reading-list.md		reading-list.md
requirements.txt		requirements.txt
setup.sh		setup.sh
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

Interpretability & Evals — a self-directed course

What's here

Progress

Techniques and papers covered

Setup

Running the tests

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages