diff --git a/CLAUDE.md b/CLAUDE.md index 264a81f..1d2544b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -208,3 +208,4 @@ Build the site after changes: `cd site && pnpm run build` (should produce 7 page - Pi provider: `toolpath-pi` reads Pi session JSONL from `~/.pi/agent/sessions/`. Sessions use a tree (id/parentId) in a single file, and may link to a parent file via `parentSession` in the header. The tree is preserved as a DAG in the derived `Path`. - Codex provider: `toolpath-codex` reads Codex CLI rollout files from `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. Sessions are date-bucketed (not project-keyed). File-change fidelity is excellent — Codex's `patch_apply_end` events carry either the unified diff (for updates) or the full file content (for adds), so the derived `Path` gets a real `raw` perspective on every file artifact. See `docs/agents/formats/codex.md` for the full format reference. - opencode provider: `toolpath-opencode` reads a SQLite database at `~/.local/share/opencode/opencode.db` (opened read-only). Each session's messages and 12 typed part variants (text, reasoning, tool, step-start/-finish, snapshot, patch, file, agent, subtask, retry, compaction) land as one step per message with tool invocations attached. File diffs come from a sibling bare git repo at `snapshot//[]/` via `git2` tree↔tree diffs — opencode respects the user's `.gitignore`, so changes under gitignored paths fall back to tool-input-derived structural changes with no `raw` perspective. Project id is the SHA of the repo's first root commit. See `docs/agents/formats/opencode.md` for the full format reference. +- Format references for the agent on-disk formats we derive from live at `docs/agents/formats/`. The Claude Code format (`~/.claude/projects/…` JSONL) gets the deepest treatment — twelve focused docs at `docs/agents/formats/claude-code/` covering envelope, entry types, tools, session chains, compaction, writing-compatible JSONL, a linear walkthrough, and a version-keyed changelog. Sibling single-file references: `codex.md`, `gemini.md`, `opencode.md`. Keep them in sync with their derive crates when fields or behaviors change. diff --git a/README.md b/README.md index 9894add..7c5d318 100644 --- a/README.md +++ b/README.md @@ -206,6 +206,8 @@ let md_string = render(&doc, &RenderOptions::default()); - [CHANGELOG.md](CHANGELOG.md) -- Release history - [schema/toolpath.schema.json](schema/toolpath.schema.json) -- JSON Schema - [examples/](examples/) -- 11 example documents covering steps, paths, and graphs +- [docs/agents/formats/](docs/agents/formats/README.md) -- Reference for the on-disk + formats emitted by agents we derive from (Claude Code today; more as they land) ## Requirements diff --git a/crates/toolpath-claude/README.md b/crates/toolpath-claude/README.md index 5cfdc2f..e98eff7 100644 --- a/crates/toolpath-claude/README.md +++ b/crates/toolpath-claude/README.md @@ -16,6 +16,12 @@ Reads Claude Code conversation data from `~/.claude/projects/` and provides: - **Derivation**: Map conversations to Toolpath Path documents - **Watching**: Monitor conversation files for live updates (feature-gated) +For the on-disk format itself — envelope fields, entry types, session chains, +compaction, and the empirical rules for writing Claude-compatible JSONL — see +[`docs/agents/formats/claude-code/`](https://github.com/empathic/toolpath/tree/main/docs/agents/formats/claude-code). +That directory is the authoritative reference; this crate is its reference +implementation. + ## Derivation Convert Claude conversations into Toolpath documents: diff --git a/crates/toolpath-claude/src/derive.rs b/crates/toolpath-claude/src/derive.rs index 62a15de..b2c7068 100644 --- a/crates/toolpath-claude/src/derive.rs +++ b/crates/toolpath-claude/src/derive.rs @@ -1687,22 +1687,26 @@ mod tests { // steps[0] = assistant turn, steps[1] = tool step (siblings). let tool_step = &path.steps[1]; let ch = &tool_step.change["/src/login.rs"]; - let raw = ch.raw.as_deref().expect("edit tool should emit unified diff"); + let raw = ch + .raw + .as_deref() + .expect("edit tool should emit unified diff"); // Leading `/` is stripped from the header so `a/`/`b/` don't double up // (git-style prefixes already denote the repo root). See #36. assert!(raw.contains("--- a/src/login.rs"), "{}", raw); assert!(raw.contains("+++ b/src/login.rs"), "{}", raw); - assert!(!raw.contains("a//"), "header should not double-slash: {}", raw); + assert!( + !raw.contains("a//"), + "header should not double-slash: {}", + raw + ); assert!(raw.contains("-validate_token()"), "{}", raw); assert!(raw.contains("+validate_token_v2()"), "{}", raw); // Sanity-check the parent wiring that the chat view relies on: // the tool step's parent is the assistant step, and they share // the same `entry.uuid` root so the frontend splice works. - assert_eq!( - tool_step.step.parents, - vec![path.steps[0].step.id.clone()] - ); + assert_eq!(tool_step.step.parents, vec![path.steps[0].step.id.clone()]); } // ── tool result assembly ────────────────────────────────────────── diff --git a/crates/toolpath-cli/src/bin/gen_synthetic_path.rs b/crates/toolpath-cli/src/bin/gen_synthetic_path.rs index 678fa0d..b77c9b9 100644 --- a/crates/toolpath-cli/src/bin/gen_synthetic_path.rs +++ b/crates/toolpath-cli/src/bin/gen_synthetic_path.rs @@ -60,11 +60,7 @@ const LOREM: &[&str] = &[ "sed ut perspiciatis unde omnis iste natus error sit voluptatem", ]; -const TOOLS: &[(&str, f64)] = &[ - ("Edit", 0.50), - ("Write", 0.30), - ("MultiEdit", 0.20), -]; +const TOOLS: &[(&str, f64)] = &[("Edit", 0.50), ("Write", 0.30), ("MultiEdit", 0.20)]; const FILES: &[&str] = &[ "src/main.rs", @@ -101,7 +97,10 @@ fn pick_tool(rng: &mut StdRng) -> &'static str { fn synth_diff(rng: &mut StdRng, path: &str) -> String { let lines = rng.random_range(3..12); - let mut s = format!("--- a/{}\n+++ b/{}\n@@ -1,{} +1,{} @@\n", path, path, lines, lines); + let mut s = format!( + "--- a/{}\n+++ b/{}\n@@ -1,{} +1,{} @@\n", + path, path, lines, lines + ); for i in 0..lines { if rng.random_bool(0.5) { s.push_str(&format!("-old_line_{} = value;\n", i)); diff --git a/crates/toolpath-cli/src/cmd_incept.rs b/crates/toolpath-cli/src/cmd_incept.rs index 546ba54..d4f2a1b 100644 --- a/crates/toolpath-cli/src/cmd_incept.rs +++ b/crates/toolpath-cli/src/cmd_incept.rs @@ -1,5 +1,10 @@ //! `path incept` — project a toolpath document into a Claude session //! that Claude Code can load and resume. +//! +//! Format rules this command obeys are documented at +//! `docs/agents/formats/claude-code/writing-compatible-jsonl.md`. When a new +//! empirical constraint is discovered here, capture it there in the same +//! change. use anyhow::Result; use std::io::Read; diff --git a/crates/toolpath-convo/src/derive.rs b/crates/toolpath-convo/src/derive.rs index 6d9171a..cdda1d5 100644 --- a/crates/toolpath-convo/src/derive.rs +++ b/crates/toolpath-convo/src/derive.rs @@ -326,7 +326,10 @@ fn file_write_change( extra.insert("edits".to_string(), serde_json::Value::Array(edits.clone())); } - (file_write_diff(&tool.name, input, path, before_state), extra) + ( + file_write_diff(&tool.name, input, path, before_state), + extra, + ) } /// Compute a unified diff string for a file-write tool invocation, given the @@ -675,8 +678,8 @@ mod tests { "file_path": "hello.txt", "content": "hi\nthere\n", }); - let raw = file_write_diff("Write", &input, "hello.txt", None) - .expect("write should emit diff"); + let raw = + file_write_diff("Write", &input, "hello.txt", None).expect("write should emit diff"); assert!(raw.contains("+hi")); assert!(raw.contains("+there")); // No `-` lines — nothing was there before. diff --git a/crates/toolpath-desktop/src/tray.rs b/crates/toolpath-desktop/src/tray.rs index aa9b08d..01f17f2 100644 --- a/crates/toolpath-desktop/src/tray.rs +++ b/crates/toolpath-desktop/src/tray.rs @@ -362,11 +362,7 @@ pub fn tray_open_trace( basename_slug(&project), short(&session_id) ); - ( - value, - format!("Claude: {}", basename(&project)), - filename, - ) + (value, format!("Claude: {}", basename(&project)), filename) } "pi" => { let value = crate::commands::derive::derive_pi( @@ -380,11 +376,7 @@ pub fn tray_open_trace( basename_slug(&project), short(&session_id) ); - ( - value, - format!("pi.dev: {}", basename(&project)), - filename, - ) + (value, format!("pi.dev: {}", basename(&project)), filename) } // Not wired up in the desktop backend yet. The popover disables // rows for these, but we still reject politely if one slips through. @@ -574,7 +566,10 @@ mod tests { // produce a well-formed snapshot with all five provider slots. let s = collect_stats(); let providers: Vec<_> = s.counts.iter().map(|c| c.provider).collect(); - assert_eq!(providers, vec!["claude", "gemini", "codex", "opencode", "pi"]); + assert_eq!( + providers, + vec!["claude", "gemini", "codex", "opencode", "pi"] + ); } #[test] diff --git a/docs/agents/formats/README.md b/docs/agents/formats/README.md new file mode 100644 index 0000000..ff53f88 --- /dev/null +++ b/docs/agents/formats/README.md @@ -0,0 +1,64 @@ +# Agent session formats + +This directory holds our working reference for the on-disk formats emitted by +coding agents whose sessions we derive `toolpath` documents from. These are the +documents we would like external consumers (other toolpath crates, workshop, +etc.) to be able to trust without having to reverse-engineer the format +themselves from a sampled `~/.claude/projects/…` directory. + +The goal is **practitioner-grade reference**: exactly what fields appear, what +they mean, where the format has quirks or bugs, and how our own code copes with +them. Not a spec — we don't own any of these formats. But close enough that a +new contributor can add a derivation or a projector without a week of cargo- +culting. + +## Contents + +- **[`claude-code/`](claude-code/README.md)** — Claude Code + (`~/.claude/projects/…` JSONL). Split into focused docs covering + directory layout, the JSONL line envelope, entry types, the `message` + object and content parts, tool invocation lifecycle, session chains + and compaction, peripheral files, writing-compatible JSONL, known + issues, a line-by-line walkthrough, and a version-keyed format + changelog. Each revision of the reference carries a date stamp at + the top of the subdirectory's README. +- **[`codex.md`](codex.md)** — Codex CLI rollout files under + `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`. Single-file reference + covering the date-bucketed session format and the `patch_apply_end` + events that drive file-change fidelity. +- **[`gemini.md`](gemini.md)** — Gemini CLI chats under + `~/.gemini/tmp//chats/`, including the main-file + sibling + sub-agent UUID directory layout. +- **[`opencode.md`](opencode.md)** — opencode's SQLite database + (`~/.local/share/opencode/opencode.db`), its 12 typed message-part + variants, and the sibling bare-git snapshot repo used for file diffs. + +The Claude Code reference is the most detailed because it's the +longest-standing provider and has the most moving parts (JSONL +envelope variants, session chaining, compaction, sidechains, and the +loader's own undocumented strictness on what it will accept). The +other three sit in single files because their formats are either +simpler or sufficiently covered there. + +## Conventions used in these docs + +- **"In the wild"** = observed in real JSONL files on disk, not just in types + we've defined. +- **Field tables** show the name as it appears in JSON (so `parentUuid`, not + `parent_uuid`), its shape, and whether it's optional. "Optional" means we've + seen entries without it; "required" means we've never seen an entry missing + it (not that the format promises it'll always be there). +- **Citations** point either to files under this repo (`crates//src/…`) + or to external sources (marked with URLs). Repo citations dominate — we + trust our own parsers and tests more than we trust blog posts. +- **Version numbers** when quoted (e.g. "Claude Code 2.1.90") are what we've + seen in sample data, not what Anthropic has officially tagged a format + change at. + +## Maintenance + +When `toolpath-claude` (or its siblings) learns about a new field, entry type, +or edge case, update the corresponding doc here in the same change. The point +of this directory is to be the single place where format knowledge +accumulates; if the knowledge only lives in code comments or commit messages, +it effectively doesn't exist. diff --git a/docs/agents/formats/claude-code/README.md b/docs/agents/formats/claude-code/README.md new file mode 100644 index 0000000..a877377 --- /dev/null +++ b/docs/agents/formats/claude-code/README.md @@ -0,0 +1,169 @@ +# Claude Code on-disk format + +> **Reference revision:** 2026-04-23 +> **Tracks:** Claude Code 2.1.x +> **First-hand samples:** 2.1.37, 2.1.90, 2.1.110, 2.1.112 +> +> When you change anything in this directory, bump the revision date +> here and in [format-changelog.md](format-changelog.md) so downstream +> readers can tell whether a cited rule is current. + +Claude Code (Anthropic's CLI coding agent) persists every conversation, +every settings change, and a fair amount of supporting state to disk +under `~/.claude/`. Anthropic documents some of this in passing but has +never published a specification of the JSONL line format, the session +directory layout, or the rules that govern things like session chaining +and compaction. + +These documents are our working reference for what Claude Code actually +writes to disk, at what version, and why. The target audience is anyone +building a tool that reads, writes, or transforms Claude Code session +data. + +## How the docs are organized + +Each doc is focused on one aspect of the format. Read them in this order +if you're new; otherwise, skip to what you need. If you prefer concrete +examples to field catalogues, start with the **walkthrough** (#11) and +use the reference docs for lookup. + +1. **[directory-layout.md](directory-layout.md)** — what files and + directories live under `~/.claude/` and how they're named. +2. **[jsonl-envelope.md](jsonl-envelope.md)** — the top-level fields that + wrap every line of a session JSONL. +3. **[entry-types.md](entry-types.md)** — the `type` discriminant and + every entry variant we've observed, including sidechains. +4. **[messages.md](messages.md)** — the `message` object, content-part + types (text/thinking/tool_use/tool_result), and role values. +5. **[tools.md](tools.md)** — how tool calls are recorded: + `tool_use`/`tool_result` pairing, `tool_use_id` linkage, and the + per-tool shape of the top-level `toolUseResult` summary. +6. **[usage.md](usage.md)** — `message.usage` and the prompt-cache TTL + breakdown. +7. **[session-chains.md](session-chains.md)** — when Claude Code rotates + to a new file, how continuations are signalled, and the `compact_boundary` + mechanic. +8. **[peripheral-files.md](peripheral-files.md)** — everything that + isn't a session JSONL: `history.jsonl`, `todos/`, `shell-snapshots/`, + `file-history/`, `statsig/`, etc. +9. **[writing-compatible-jsonl.md](writing-compatible-jsonl.md)** — + empirically discovered constraints if you want Claude Code to load + JSONL that your tool produced. +10. **[known-issues.md](known-issues.md)** — format-level bugs, + corruption modes, and version drift to defend against. +11. **[walkthrough.md](walkthrough.md)** — a representative session + read linearly, line by line, with cross-links back to the + reference docs at each step. +12. **[format-changelog.md](format-changelog.md)** — version-keyed + record of field and behavior changes across Claude Code releases. + +## Scope and sourcing + +The format is undocumented at the envelope level. These docs are +compiled from: + +- Inspection of real session files under `~/.claude/projects/` across + Claude Code versions 2.1.37 through 2.1.112. +- Reading code that produces or consumes the format (both ours and other + tools in the local empathic monorepo). +- Round-trip experiments: writing JSONL and observing whether Claude + Code's loader accepts it. + +Where a claim is empirical or version-dependent, the doc says so. Where +it's a guess, the doc says so. + +## Conventions + +- **Field names** are shown as they appear in JSON. Envelope keys are + camelCase (`parentUuid`), API-adjacent keys inside `message` are + snake_case (`stop_reason`). +- **"Observed"** means seen in real files on disk. +- **"Expected"** means we have structural reasons to believe it's + there (e.g. a loader rejects it when absent) even if we haven't + enumerated it in samples. +- **Versions in parentheses** (e.g. "2.1.90+") indicate when a field or + behavior first appeared or changed, to the extent we know. +- **Keep headings anchor-stable.** Cross-links use GitHub's auto-anchors + (lowercased, punctuation stripped, spaces to hyphens). Avoid em-dashes + and other decorative punctuation in a heading that is, or might be, + linked to — they render inconsistently and tend to bite renames. + +## Field index + +Quick lookup: which doc defines a given field? Fields local to a single +entry type are grouped under it. + +| Field | Defined in | +|--------------------------------|------------| +| `agentId` | [jsonl-envelope.md](jsonl-envelope.md), [entry-types.md §Sidechains](entry-types.md#sidechains) | +| `attachment` | [jsonl-envelope.md](jsonl-envelope.md), [entry-types.md](entry-types.md) | +| `cache_creation` / `cache_*_input_tokens` | [usage.md](usage.md) | +| `caller` (on `tool_use`) | [messages.md](messages.md) | +| `compactMetadata` | [entry-types.md](entry-types.md), [session-chains.md](session-chains.md) | +| `content` (envelope, on `queue-operation`) | [entry-types.md](entry-types.md) | +| `content` (inside `message` / `tool_result`) | [messages.md](messages.md) | +| `cwd` | [jsonl-envelope.md](jsonl-envelope.md), [writing-compatible-jsonl.md](writing-compatible-jsonl.md) | +| `durationMs` | [entry-types.md](entry-types.md) | +| `entrypoint` | [jsonl-envelope.md](jsonl-envelope.md) | +| `gitBranch` | [jsonl-envelope.md](jsonl-envelope.md) | +| `hookCount` / `hookInfos` / `hookErrors` | [jsonl-envelope.md](jsonl-envelope.md) | +| `id` (inside `message`) | [messages.md](messages.md) | +| `id` (on `tool_use`) | [messages.md](messages.md), [tools.md](tools.md) | +| `iterations` | [usage.md](usage.md) | +| `inference_geo` | [usage.md](usage.md) | +| `input_tokens` / `output_tokens` | [usage.md](usage.md) | +| `isCompactSummary` | [entry-types.md](entry-types.md), [session-chains.md](session-chains.md) | +| `isMeta` | [jsonl-envelope.md](jsonl-envelope.md) | +| `isSidechain` | [jsonl-envelope.md](jsonl-envelope.md), [entry-types.md §Sidechains](entry-types.md#sidechains) | +| `isSnapshotUpdate` | [entry-types.md](entry-types.md) | +| `isVisibleInTranscriptOnly` | [entry-types.md](entry-types.md), [session-chains.md](session-chains.md) | +| `lastPrompt` | [entry-types.md](entry-types.md) | +| `leafUuid` | [entry-types.md](entry-types.md) | +| `level` | [jsonl-envelope.md](jsonl-envelope.md) | +| `logicalParentUuid` | [entry-types.md](entry-types.md), [session-chains.md](session-chains.md) | +| `message` | [messages.md](messages.md) | +| `messageCount` (envelope) | [entry-types.md](entry-types.md) | +| `messageId` | [jsonl-envelope.md](jsonl-envelope.md), [entry-types.md](entry-types.md), [known-issues.md](known-issues.md) | +| `model` (inside `message`) | [messages.md](messages.md) | +| `operation` | [entry-types.md](entry-types.md) | +| `parentUuid` | [jsonl-envelope.md](jsonl-envelope.md), [known-issues.md](known-issues.md) | +| `permissionMode` | [entry-types.md](entry-types.md), [writing-compatible-jsonl.md](writing-compatible-jsonl.md) | +| `preventedContinuation` | [jsonl-envelope.md](jsonl-envelope.md) | +| `requestId` | [jsonl-envelope.md](jsonl-envelope.md) | +| `role` | [messages.md](messages.md) | +| `server_tool_use` | [usage.md](usage.md) | +| `service_tier` | [usage.md](usage.md) | +| `sessionId` | [jsonl-envelope.md](jsonl-envelope.md), [session-chains.md](session-chains.md) | +| `signature` (on `thinking`) | [messages.md](messages.md), [writing-compatible-jsonl.md](writing-compatible-jsonl.md) | +| `slug` | [jsonl-envelope.md](jsonl-envelope.md), [session-chains.md](session-chains.md) | +| `snapshot` | [entry-types.md](entry-types.md) | +| `sourceToolAssistantUUID` | [jsonl-envelope.md](jsonl-envelope.md), [tools.md](tools.md) | +| `speed` | [usage.md](usage.md) | +| `stop_reason` / `stop_sequence` | [messages.md](messages.md), [known-issues.md](known-issues.md) | +| `stopReason` (envelope) | [jsonl-envelope.md](jsonl-envelope.md) | +| `subtype` | [entry-types.md](entry-types.md) | +| `summary` / `leafUuid` | [entry-types.md](entry-types.md) | +| `thinking` / `signature` | [messages.md](messages.md) | +| `thinkingMetadata` | [jsonl-envelope.md](jsonl-envelope.md) | +| `timestamp` | [jsonl-envelope.md](jsonl-envelope.md) | +| `tool_use` / `tool_result` | [messages.md](messages.md), [tools.md](tools.md) | +| `tool_use_id` | [messages.md](messages.md), [tools.md](tools.md) | +| `toolUseResult` | [jsonl-envelope.md](jsonl-envelope.md), [tools.md](tools.md) | +| `trackedFileBackups` | [entry-types.md](entry-types.md), [peripheral-files.md](peripheral-files.md) | +| `type` (envelope) | [entry-types.md](entry-types.md) | +| `type` (content-part) | [messages.md](messages.md) | +| `usage` | [usage.md](usage.md) | +| `userType` | [jsonl-envelope.md](jsonl-envelope.md) | +| `uuid` | [jsonl-envelope.md](jsonl-envelope.md) | +| `version` | [jsonl-envelope.md](jsonl-envelope.md) | + +For the mapping from these JSON keys to Rust fields in +`ConversationEntry`, see the parser-surface table in +[jsonl-envelope.md](jsonl-envelope.md#parser-surface-vs-format-surface). + +## Maintenance + +When a new field, entry type, or behavior shows up in the wild, update +the relevant doc in the same change. The index above is the table of +contents; keep it in sync. When you add or rename a field, update the +field index in this README too. diff --git a/docs/agents/formats/claude-code/directory-layout.md b/docs/agents/formats/claude-code/directory-layout.md new file mode 100644 index 0000000..cdcbbc1 --- /dev/null +++ b/docs/agents/formats/claude-code/directory-layout.md @@ -0,0 +1,80 @@ +# `~/.claude/` directory layout + +Everything Claude Code persists to disk lives under `~/.claude/` (on +Windows, `%USERPROFILE%\.claude\`). The directory is created on first +run and populated lazily as features are used. + +## The tree + +``` +~/.claude/ +├── projects/ # per-project session storage — the main artifact +│ └── / +│ ├── .jsonl +│ ├── .jsonl +│ └── … +├── history.jsonl # global user-prompt history (separate format) +├── settings.json # user-level config (permissions, hooks, etc.) +├── todos/ # per-session TodoWrite state (legacy; see peripheral-files.md) +├── shell-snapshots/ # zsh/bash env dumps captured at session start +├── file-history/ # content-addressed file backups for undo/rollback +├── session-env/ # per-session env-var snapshots +├── plans/ # plan-mode artifacts +├── sessions/ # session cache / index +├── statsig/ # Anthropic feature-flag cache +├── plugins/ # installed plugins +├── debug/ # per-session debug logs +├── paste-cache/ # clipboard history +├── backups/ # backups of ~/.claude.json +└── ide/ # per-IDE lock files (e.g. VS Code attach points) +``` + +Only `projects/` is necessary for reconstructing a conversation. The +rest is supporting state; see [peripheral-files.md](peripheral-files.md) +for the bits that matter. + +## `projects/` directory naming + +Each project directory is named after the **resolved absolute path** of +the working directory in which Claude Code was invoked, with `/` and +`_` characters replaced by `-`. + +| Original cwd | Directory name | +|---------------------------------------------------|----------------------------------------| +| `/Users/alex/Devel/empathic/toolpath` | `-Users-alex-Devel-empathic-toolpath` | +| `/Users/alex/my_project` | `-Users-alex-my-project` | +| `/home/bob/code` | `-home-bob-code` | + +### Sanitization is lossy + +Both `/` and `_` map to `-`, so you cannot round-trip perfectly. Any +tool unsanitizing a directory name can only guess where path separators +were. For the typical case (no underscores in directory names) it works. + +### Path canonicalization before sanitization + +Claude Code resolves the cwd to a canonical path before sanitizing. On +macOS in particular, `/tmp` is a symlink to `/private/tmp`: + +| Launched in | Resolves to | Directory name | +|---------------------|------------------------|----------------------| +| `/tmp/foo` | `/private/tmp/foo` | `-private-tmp-foo` | + +If you're building test fixtures or synthesizing directory paths, you +must canonicalize first. A tool that writes to `-tmp-foo/` will not +show up when Claude Code is launched in `/tmp/foo`. + +### Session files + +Each conversation is a single JSONL file named `.jsonl` +where `session-uuid` is a UUIDv4. A single *logical* conversation can +span multiple files if Claude Code rotated mid-session — see +[session-chains.md](session-chains.md). + +## `~/.claude.json` + +Distinct from `~/.claude/` (note the missing dot-dir). This is a single +JSON file at `$HOME/.claude.json` holding OAuth tokens, per-project +state (MCP servers, allowed tools, `lastSessionId`, trust-dialog state), +and other cross-session settings. Backups of it land in +`~/.claude/backups/`. diff --git a/docs/agents/formats/claude-code/entry-types.md b/docs/agents/formats/claude-code/entry-types.md new file mode 100644 index 0000000..b139aac --- /dev/null +++ b/docs/agents/formats/claude-code/entry-types.md @@ -0,0 +1,405 @@ +# Entry types + +Every session-JSONL line has a `type` field that identifies what kind of +entry it is. A real parser must handle all of the following; a parser +that assumes only `user` / `assistant` / `system` will silently drop +a large fraction of the file. + +## Summary table + +| `type` | Has `message`? | Has `uuid`? | Purpose | +|-------------------------|----------------|-------------|---------| +| `user` | yes | yes | User prompt, slash command, or a synthesized user entry carrying tool results. | +| `assistant` | yes | yes | Claude's response. Content is almost always an array of parts. | +| `system` | varies | yes | Metadata entries. Discriminated further by `subtype`. | +| `attachment` | no | yes | Tool-availability delta (e.g. a deferred tool being loaded). | +| `file-history-snapshot` | no | no | File-state snapshot for undo/rollback. | +| `permission-mode` | no | no | Records the active permission mode. | +| `queue-operation` | no | no | Enqueue/dequeue of a typed-ahead message while an assistant turn is in flight. | +| `last-prompt` | no | no | Cached last user prompt. | +| `summary` | no | no | Conversation summary. May live in a different JSONL file than the conversation it describes. | +| `compact_boundary` | no (usually) | yes | Marks an autocompaction event. Also appears as `system.subtype` in some versions. | +| `progress` | no | yes | Long-running tool progress event. Should be skipped when reconstructing a transcript. | + +--- + +## `user` + +User prompts and synthesized user-role entries. More variety than the +name suggests. Several subclassifications exist: + +### Direct user prompt + +`message.content` is a bare string. No distinguishing field beyond that. + +```json +{"type": "user", "message": {"role": "user", "content": "what does this file do?"}, ...} +``` + +### Slash command + +`message.content` is a string containing XML-ish tags: + +``` +jevan +/jevan +please enumerate… +``` + +No separate `type` — the tags inside the content are the discriminator. + +### Tool result carrier + +The user entry that follows an assistant `tool_use`. `message.content` +is an array containing one or more `tool_result` blocks (see +[tools.md](tools.md)). Additional envelope fields: + +- `toolUseResult` — top-level structured summary. +- `sourceToolAssistantUUID` — points at the assistant entry that issued + the tool call. + +The human did not type this turn; a consumer rendering the transcript +should typically fold it into the preceding assistant turn rather than +display it. + +### Command output injection + +Output from a local command (e.g. `!ls`) is injected back as a user +entry. Distinguishable by the content format; exact shape varies. + +### Hook result + +Output from `UserPromptSubmit` and similar hooks is injected as a user +entry. + +### System caveat + +An internal system-inserted user-role note. Rare; discriminator is a +subfield inside the content, not a top-level type. + +### Classifying a user entry + +The envelope alone does not tell you which subclass a `user` entry is — +you have to look at a handful of fields together. Decision tree: + +```python +def classify_user(entry): + # 1. Tool-result carrier: synthesized, not human-typed. + if entry.get("toolUseResult") is not None: + return "tool_result_carrier" + if entry.get("sourceToolAssistantUUID") is not None: + return "tool_result_carrier" + + msg = entry.get("message") or {} + content = msg.get("content") + + # Array-form content with only tool_result parts: same class. + if isinstance(content, list) and content and all( + (p.get("type") == "tool_result") for p in content + ): + return "tool_result_carrier" + + # 2. Compaction summary: synthetic, flagged explicitly. + if entry.get("isCompactSummary") is True: + return "compact_summary" + + # 3. Slash command: string content with the XML-ish command tags. + if isinstance(content, str) and "" in content: + return "slash_command" + + # 4. Command output injection: string content with the local-command-output tags. + if isinstance(content, str) and ( + "" in content or + "" in content + ): + return "command_output" + + # 5. Hook result: string content wrapped in UserPromptSubmit-style tags. + if isinstance(content, str) and "" in content: + return "hook_result" + + # 6. System caveat: string content in -class tags. + if isinstance(content, str) and "" in content: + return "system_caveat" + + # 7. Otherwise: a direct user prompt. + return "direct_prompt" +``` + +The tags embedded in content strings are the most reliable way to +distinguish synthesized-user entries from real ones. Order matters — +check `toolUseResult` / `sourceToolAssistantUUID` first, then +`isCompactSummary`, then fall through to content inspection. + +--- + +## `assistant` + +Claude's response. `message.content` is almost always an array of parts +(text, thinking, tool_use). See [messages.md](messages.md) for part +types. + +Envelope specifics: +- Carries `requestId` (Anthropic API request ID). +- Carries `message.model`, `message.id`, `message.usage`, + `message.stop_reason` (though `stop_reason` is frequently `null` on + disk, even for completed turns — see + [known-issues.md](known-issues.md)). + +A single logical assistant turn can span **multiple `assistant` entries** +— Claude Code splits thinking, text, and tool_use into separate entries +within a turn. You cannot assume one `assistant` entry = one turn. + +--- + +## `system` + +Metadata entries. Discriminated by `subtype`. Treat `subtype` as an +open enumeration; new values appear across versions. + +### `subtype: "turn_duration"` + +Emitted after an assistant turn completes. Carries `durationMs` and +`messageCount`. Useful as an authoritative "turn ended" signal. + +```json +{ + "type": "system", + "subtype": "turn_duration", + "durationMs": 57560, + "messageCount": 24, + ... +} +``` + +### `subtype: "compact_boundary"` + +Marks an autocompaction. Carries `compactMetadata` (`trigger`, `preTokens`) +and `logicalParentUuid`. See [session-chains.md §Compaction](session-chains.md#compaction--compact_boundary). + +In some versions this appears as a top-level `type: "compact_boundary"` +rather than `type: "system"` with this subtype; treat them as the same +concept. + +### `subtype: "stop_hook_summary"` + +Emitted by `Stop` hooks. Carries `hookCount`, `hookInfos`, `hookErrors`, +`preventedContinuation`, `stopReason`, `level: "suggestion"`. + +### `subtype: "task_started"` / `"task_progress"` / `"task_notification"` + +Task-tool lifecycle events. + +### Other subtypes + +Other values (e.g. `"init"`) have been observed. Log unknowns rather +than reject. + +--- + +## `attachment` + +Records a change in the available tool set mid-session. The primary +case is `type: "deferred_tools_delta"`, emitted when a deferred tool +is loaded into the active set. + +```json +{ + "type": "attachment", + "attachment": { + "type": "deferred_tools_delta", + "addedNames": ["WebFetch", "WebSearch"], + "addedLines": ["WebFetch", "WebSearch"], + "removedNames": [] + }, + "uuid": "...", + "parentUuid": "...", + ... +} +``` + +--- + +## `file-history-snapshot` + +Captures file state at a given point, keyed by `messageId`: + +```json +{ + "type": "file-history-snapshot", + "messageId": "67602940-a209-437d-a791-72bf4c09c0ea", + "snapshot": { + "messageId": "67602940-a209-437d-a791-72bf4c09c0ea", + "trackedFileBackups": {}, + "timestamp": "2026-04-02T13:59:26.313Z" + }, + "isSnapshotUpdate": false +} +``` + +`trackedFileBackups` is usually empty in observed data; when populated, +it maps file paths to references into the content-addressed backups +under `~/.claude/file-history/`. These support Claude Code's undo +machinery. + +Note: this entry type has **no `uuid`** — the `messageId` it carries +references a different entry's `uuid`. On resume there's a known bug +where these `messageId`s collide with real message `uuid`s; see +[known-issues.md](known-issues.md). + +--- + +## `permission-mode` + +Records the active permission mode. Strict three-field shape: + +```json +{"type": "permission-mode", "permissionMode": "default", "sessionId": "..."} +``` + +Permission-mode values observed: `"default"`, `"acceptEdits"`, `"plan"`, +`"bypassPermissions"`. + +**Strictness note:** adding any other fields to this entry (including +`uuid: ""`, `isSidechain: false`, or a trailing `parentUuid`) causes +Claude Code's loader to reject it. See +[writing-compatible-jsonl.md](writing-compatible-jsonl.md). + +--- + +## `queue-operation` + +Records typed-ahead message queueing during an assistant turn. + +```json +{ + "type": "queue-operation", + "operation": "enqueue", + "content": "Probably phrase them as \"virtual artifacts\"?", + "timestamp": "2026-04-16T19:33:16.814Z", + "sessionId": "..." +} +``` + +`operation` values: `"enqueue"` (user queued a message), `"dequeue"` +(Claude Code consumed it). Note the top-level `content` field — +distinct from `message.content`. + +--- + +## `last-prompt` + +Cached last user prompt, for resume / history purposes: + +```json +{"type": "last-prompt", "lastPrompt": "the repo is…", "sessionId": "..."} +``` + +--- + +## `summary` + +Conversation summary entries. Minimal shape: + +```json +{"type": "summary", "summary": "...", "leafUuid": "..."} +``` + +Summaries are generated asynchronously relative to the conversation +they describe, and a summary may appear **in a different JSONL file** +than the session it summarizes. `leafUuid` is the UUID of the message +being summarized. + +A parser that wants to associate summaries with their conversations +must cross-match by `leafUuid` across all files in the project +directory. + +--- + +## `compact_boundary` + +Marks an autocompaction. May appear either as a top-level `type` or as +`type: "system"` with `subtype: "compact_boundary"` — treat them as +equivalent. + +```jsonc +{ + "type": "compact_boundary", + "uuid": "...", + "parentUuid": null, // always null + "logicalParentUuid": "...", // the real prior message UUID + "compactMetadata": { + "trigger": "auto", // or "manual" + "preTokens": 180000 + }, + ... +} +``` + +Immediately followed by a synthetic `user`-role message with +`isCompactSummary: true` and `isVisibleInTranscriptOnly: true` +carrying the compacted summary as its content. See +[session-chains.md](session-chains.md) for how this interacts with +file rotation. + +--- + +## `progress` + +Long-running tool progress events. Emitted by tools that stream +intermediate output. Should be **skipped** when reconstructing a +conversation transcript — they represent incomplete state that is +superseded by the eventual `tool_result`. + +--- + +## Sidechains + +Sidechains aren't an entry type — they're a property (`isSidechain: true`) +that applies to `user`, `assistant`, and other entries alike. A +sidechain is a conversation thread spawned by the `Task` tool (a +subagent) or by the `/btw` slash command (aside questions). + +### Representation + +Sidechain entries carry: +- `isSidechain: true` on every entry in the thread. +- `agentId` — short hash identifying the subagent, e.g. `"a7bf2fd"`. +- The thread root has `parentUuid: null` and its `message.content` is + the Task tool's input prompt (for Task-spawned agents). + +Two layouts exist depending on Claude Code version: + +1. **Inline** (newer, 2.1.x+) — sidechain entries are written into the + same JSONL as the parent session, distinguished only by + `isSidechain: true`. +2. **Separate file** (older) — subagents got their own + `/subagents/agent-.jsonl` files, with matching + `.meta.json` sidecars. + +### Usage accounting + +Sidechain token usage does **not** count against the parent +conversation's context. Cache-read tokens from sidechains often +"mirror" the parent because the prompt cache is shared. + +### Parent linkage + +Sidechain entries carry their own `parentUuid` chain within the +sidechain thread. The link *to* the parent conversation is implicit — +established by the `Task` tool invocation in the parent, which carries +the `agentId` in its result. + +--- + +## Type discrimination pseudocode + +```python +def classify(entry): + t = entry["type"] + if t == "system": + return ("system", entry.get("subtype", "unknown")) + if t == "user": + return ("user", classify_user(entry)) # see §Classifying a user entry + return (t, None) +``` diff --git a/docs/agents/formats/claude-code/format-changelog.md b/docs/agents/formats/claude-code/format-changelog.md new file mode 100644 index 0000000..78e27ed --- /dev/null +++ b/docs/agents/formats/claude-code/format-changelog.md @@ -0,0 +1,164 @@ +# Format changelog + +A version-keyed record of field and behavior changes we've seen in the +Claude Code on-disk format. This is compiled from samples on disk and +from reading upstream code; it is **not** a changelog Anthropic +publishes, and the exact patch version where something landed is +usually a best guess. + +## How to read this + +- **"Observed across 2.1.37 – 2.1.112"** means every sampled version in + that range contained it. Versions outside that range are not first-hand. +- **"2.1.x+ (origin unclear)"** means we saw it in 2.1.x samples but + don't know which patch introduced it. +- **"2.0.x+"** means we believe from upstream code / changelogs that + the field was present in 2.0.x, even though we don't have first-hand + samples from that era. +- **"Pre-sample era"** entries describe the pre-2.0 layout from code + reading and upstream references, not from files we've inspected. + +When precision matters, treat a version here as an upper bound ("no +later than") unless the note says otherwise. + +## Format-revision stamp + +This reference tracks Claude Code **2.1.x**. First-hand samples span +client versions 2.1.37, 2.1.90, 2.1.110, and 2.1.112. Reference +revision: **2026-04-23**. + +--- + +## Claude Code 2.1.x + +### Envelope + +| Field / behavior | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| `type`, `uuid`, `timestamp`, `sessionId`, `parentUuid` | Observed across 2.1.37 – 2.1.112 | +| `cwd`, `gitBranch`, `version`, `userType`, `entrypoint`| Observed across 2.1.37 – 2.1.112 | +| `requestId` on assistant entries | Observed across 2.1.37 – 2.1.112 | +| `slug` (human-readable conversation slug) | 2.1.x+ (origin unclear); persists across rotations | +| `agentId` on sidechain entries | 2.1.x+ (inline-sidechain layout) | +| `isSidechain: true` for inline sidechains | 2.1.x+; replaces the older separate-file layout | +| `thinkingMetadata` on some user entries | 2.1.x+ (origin unclear) | +| Hook-injected envelope fields (`hookCount`, `hookInfos`, `hookErrors`, `preventedContinuation`, `stopReason`, `level`) | 2.1.x+ (origin unclear) | + +See [jsonl-envelope.md](jsonl-envelope.md) for field definitions. + +### Entry types + +| Entry type | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| `user`, `assistant`, `system` | Observed across 2.1.37 – 2.1.112 | +| `file-history-snapshot` | Observed across 2.1.37 – 2.1.112 | +| `permission-mode` | Observed across 2.1.37 – 2.1.112 | +| `summary` | Observed across 2.1.37 – 2.1.112 | +| `attachment` (deferred-tool deltas) | 2.1.x+ (origin unclear) | +| `queue-operation` (typed-ahead message enqueue/dequeue)| 2.1.x+ (origin unclear) | +| `progress` (streaming tool output) | 2.1.x+ (origin unclear) | +| `last-prompt` | 2.1.x+ (origin unclear) | +| `compact_boundary` as top-level `type` | Newer variant; coexists with older `type: "system"` + `subtype: "compact_boundary"` | +| `system.subtype` values: `turn_duration`, `stop_hook_summary`, `task_started`/`task_progress`/`task_notification` | 2.1.x+ (origin unclear) | + +See [entry-types.md](entry-types.md). + +### `message.usage` subfields + +| Subfield | Since | +|--------------------------------------------------------|----------------------------------------------------| +| `input_tokens`, `output_tokens` | Always | +| `cache_creation_input_tokens`, `cache_read_input_tokens` (flat) | Always (when caching was used) | +| `cache_creation: { ephemeral_5m_input_tokens, ephemeral_1h_input_tokens }` | 2.0.x+ | +| `service_tier` | 2.0.x+ | +| `server_tool_use: { web_search_requests, web_fetch_requests }` | 2.1.x+ | +| `iterations` | 2.1.x+ | +| `inference_geo`, `speed` | 2.1.x+ (often empty on observed samples) | + +See [usage.md](usage.md). + +### Content parts + +| Part / behavior | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| `text`, `tool_use`, `tool_result` | Observed across 2.1.37 – 2.1.112 | +| `thinking` with `signature` | Observed on models that support extended thinking (`claude-opus-4-6`, later `-4-7`) | +| `redacted_thinking` | Format-defined; rarely on disk in coding sessions | +| `image`, `document`, `server_tool_use`, `web_search_tool_result` | Format-defined; rarely on disk in coding sessions | + +`thinking` parts require a valid Anthropic-issued `signature`; otherwise +they are silently dropped on resume. See [messages.md](messages.md). + +### Session chains and compaction + +| Behavior | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| Bridge entry (first real entry of successor file carries the **previous** session's `sessionId`) | Observed across 2.1.37 – 2.1.112 | +| `slug` persistence across rotations | 2.1.x+ | +| Duplicate `compact_boundary` at top of successor files | 2.1.x+ (observed intermittently) | +| Inline compaction (`compact_boundary` + synthetic `user` summary with `isCompactSummary: true` / `isVisibleInTranscriptOnly: true`) | 2.1.x+ is the default | +| Rotation on autocompact to a separate `acompact-.jsonl` | Older behavior; not observed in 2.1.x | +| Inline sidechains (`isSidechain: true` + `agentId` in the main file) | 2.1.x+ default | +| Separate-file sidechains under `subagents/agent-.jsonl` | Older behavior; compatibility path only | +| `compactMetadata.trigger`: `"auto"` / `"manual"` | Observed in 2.1.x | +| `compactMetadata.preTokens` | Observed in 2.1.x | + +See [session-chains.md](session-chains.md). + +### Peripheral files + +| Path | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| `~/.claude/todos/` | **Marked legacy**; current versions inline TodoWrite state into `tool_result` entries | +| `~/.claude/history.jsonl` | Observed across 2.1.37 – 2.1.112; Unix-millis timestamps (distinct from session JSONL ISO-8601) | +| `~/.claude/history.jsonl` per-entry `sessionId` | Absent on older entries; present on newer ones | +| `~/.claude/sessions/sessions-index.json` | Observed in some versions; can be stale or missing | +| `~/.claude/file-history//@v` | Observed across 2.1.37 – 2.1.112 | +| `~/.claude/shell-snapshots/snapshot---.sh` | Observed across 2.1.37 – 2.1.112 | + +See [peripheral-files.md](peripheral-files.md). + +### Tools + +The built-in tool set is the fastest-moving surface. We do not track it +as an enumerated changelog here — treat [tools.md §Common tool `input` +shapes](tools.md#common-tool-input-shapes) as illustrative, and check +Anthropic's tool documentation for the authoritative current list. + +What *is* stable enough to version-track: + +| Behavior | Status in 2.1.x | +|--------------------------------------------------------|----------------------------------------------------| +| `tool_use` / `tool_result` two-entry pairing | Observed across 2.1.37 – 2.1.112 | +| `sourceToolAssistantUUID` back-reference on tool-result carrier | Observed across 2.1.37 – 2.1.112 | +| Top-level `toolUseResult` as sibling of `message` | Observed across 2.1.37 – 2.1.112 for structured-output tools | +| Parallel tool calls (multiple `tool_use` in one assistant entry) | Observed across 2.1.37 – 2.1.112 | +| Very-large-output spill to `projects///tool-results/` | Documented behavior; not observed first-hand in our samples | + +--- + +## Pre-sample era (before 2.1.37) + +Drawn from upstream code reading and adjacent tooling, not from files +we've inspected directly. Treat as orientation, not as verified fact. + +- **Subagents stored per-file** under + `projects//subagents/agent-.jsonl`, with a matching + `.meta.json` sidecar, and autocompacted subagents under + `agent-acompact-.jsonl`. +- **Autocompaction rotated to a new file** named `acompact-.jsonl` + rather than recording an inline `compact_boundary`. +- **`message.usage`** was flatter — no `cache_creation` TTL breakdown, + no `service_tier`, no `server_tool_use`, no `iterations`. +- **`history.jsonl`** entries did not always carry `sessionId`. + +--- + +## Process + +When you add a field, entry type, or behavior note to any other doc in +this directory, add a corresponding row here in the same change. Cite +the version where you can; cite "2.1.x+ (origin unclear)" when you +can't. Bump the format-revision stamp at the top of this file *and* +the matching stamp in [README.md](README.md) whenever you update the +changelog. diff --git a/docs/agents/formats/claude-code/jsonl-envelope.md b/docs/agents/formats/claude-code/jsonl-envelope.md new file mode 100644 index 0000000..0a04996 --- /dev/null +++ b/docs/agents/formats/claude-code/jsonl-envelope.md @@ -0,0 +1,150 @@ +# Session JSONL: the line envelope + +A session JSONL is a file where each line is a standalone JSON object. +No framing, no leading sentinel, no trailer. Lines are terminated with +`\n`; blank lines occur but should be ignored. + +This document covers the **envelope** — the top-level fields that wrap +each line. The `message` sub-object and tool-invocation details have +their own docs ([messages.md](messages.md), [tools.md](tools.md)). + +## Field conventions + +- **Envelope keys are camelCase** (`parentUuid`, `sessionId`, + `isSidechain`). +- **Keys inside `message` that map to the Anthropic API are snake_case** + (`stop_reason`, `input_tokens`). Some older samples use camelCase + aliases for API-side keys; a tolerant parser accepts both. +- **Field presence depends on entry type.** There is no single union + that applies to every line; `type` is the discriminant. See + [entry-types.md](entry-types.md). + +## Complete field catalogue + +Every envelope field we have observed, in rough order of prominence: + +### Identity / position + +| Field | Shape | Notes | +|----------------|----------------------------|-------| +| `type` | string | Discriminant. Values in [entry-types.md](entry-types.md). Present on **every** line. | +| `uuid` | UUIDv4 string | Per-entry ID. Empty string (`""`) or absent on some metadata entries (`permission-mode`, `queue-operation`, `last-prompt`, `file-history-snapshot`). | +| `timestamp` | ISO-8601 string | e.g. `"2026-04-02T13:59:26.313Z"`. Millisecond precision. Absent on pure-metadata entries. | +| `sessionId` | UUIDv4 string | Usually equals the filename stem. On continuation files, the first real entry carries the **previous** session's ID — see [session-chains.md](session-chains.md). | +| `parentUuid` | UUIDv4 string \| `null` | Prior entry in the conversation DAG. `null` for the first entry of a session and for `compact_boundary` entries (which use `logicalParentUuid` instead). | + +### Context at time of entry + +| Field | Shape | Notes | +|----------------|-------------------|-------| +| `cwd` | absolute path | Working directory when the entry was written. Will match the pre-sanitization path used in `projects/`. | +| `gitBranch` | string | Current branch. **Empty string** (not `null`, not absent) when the cwd isn't a git repo. | +| `version` | string | Claude Code client version, e.g. `"2.1.90"`. Observed values include `2.1.37`, `2.1.90`, `2.1.110`, `2.1.112`. Absent on some metadata entries. | +| `userType` | string | Almost always `"external"`. Other values (e.g. `"ant"` for Anthropic-internal) have been reported but are not common. | +| `entrypoint` | string | How the session was launched: `"cli"` is typical. Other values (e.g. `"claude-desktop"`) exist. | + +### API correlation + +| Field | Shape | Appears on | Notes | +|--------------|--------------|------------|-------| +| `requestId` | string | `assistant` | Anthropic API request ID, e.g. `"req_011CZf4hD7fj…"`. Useful for deduping streamed messages. | +| `messageId` | UUID string | `file-history-snapshot` (and as a reference from other entries) | The Anthropic API message ID, *distinct* from the entry's own `uuid`. Known collision bug on resume (see [known-issues.md](known-issues.md)). | + +### Conversation threading + +| Field | Shape | Notes | +|-------------------|-------------|-------| +| `isSidechain` | bool | `true` for entries belonging to a Task-tool-spawned subagent. See [entry-types.md §Sidechains](entry-types.md#sidechains). | +| `agentId` | short hex string | Identifies a subagent thread. Present on sidechain entries. Format is a short hash (e.g. `"a7bf2fd"`). | +| `slug` | string | Human-readable conversation slug (e.g. `"crystalline-giggling-sunset"`). **Persists across rotations**, making it one of the more reliable signals for linking continuation files. | + +### Tool mechanics + +| Field | Shape | Appears on | Notes | +|----------------------------|---------|------------|-------| +| `toolUseResult` | object | `user` entries carrying `tool_result` blocks | Top-level structured summary of the tool's output. Sibling of `message`, not nested. See [tools.md](tools.md). | +| `sourceToolAssistantUUID` | UUIDv4 | `user` entries carrying `tool_result` blocks | `uuid` of the assistant entry that issued the matching `tool_use`. | + +### Entry-type-specific fields + +| Field | Shape | Appears on | Notes | +|----------------------|---------|------------|-------| +| `message` | object | `user`, `assistant` | Conversation payload. See [messages.md](messages.md). | +| `snapshot` | object | `file-history-snapshot` | `{messageId, trackedFileBackups, timestamp}`. | +| `isSnapshotUpdate` | bool | `file-history-snapshot` | Whether this snapshot updates a prior one. | +| `permissionMode` | string | `permission-mode` | `"default"`, `"acceptEdits"`, `"plan"`, `"bypassPermissions"`. | +| `attachment` | object | `attachment` | Delta to available tools, e.g. `{type: "deferred_tools_delta", addedNames: [...], addedLines: [...], removedNames: [...]}`. | +| `operation` | string | `queue-operation` | `"enqueue"` / `"dequeue"`. | +| `content` | string | `queue-operation` | Queued message text. **Conflicts in name with `message.content`** — distinguish by entry `type`. | +| `lastPrompt` | string | `last-prompt` | Cached last user prompt. | +| `subtype` | string | `system`, `compact_boundary` | Discriminant within metadata entries. See [entry-types.md](entry-types.md). | +| `durationMs` | number | `system` (turn_duration) | Milliseconds the assistant turn took. | +| `messageCount` | number | `system` (turn_duration) | Number of messages in the turn. | +| `isMeta` | bool | various | API-visibility flag. `true` marks entries the loader should hide from the transcript sent back to the API. | +| `isCompactSummary` | bool | synthetic user message after `compact_boundary` | Always paired with `isVisibleInTranscriptOnly: true`. | +| `isVisibleInTranscriptOnly` | bool | see above | Entry is visible in the UI but not replayed to the model. | +| `logicalParentUuid` | UUID | `compact_boundary` | Points at the pre-compact last message. `parentUuid` is `null` on these. | +| `compactMetadata` | object | `compact_boundary` | `{trigger: "auto"|"manual", preTokens: number}`. | +| `thinkingMetadata` | object | some user entries | `{level, disabled, triggers[]}`. Indicates extended-thinking configuration. | + +### Hook-injected fields + +When a hook contributes to an entry (e.g. `Stop` hook writing a summary +`system` entry), additional fields appear: + +| Field | Shape | Notes | +|--------------------------|-----------------|-------| +| `hookCount` | number | How many hooks ran for this event. | +| `hookInfos` | array | Per-hook info objects. | +| `hookErrors` | array | Per-hook error objects. | +| `preventedContinuation` | bool | Whether hook output blocked Claude from continuing. | +| `stopReason` | string | Hook-provided reason (distinct from `message.stop_reason`). | +| `level` | string | Severity, e.g. `"suggestion"`, `"info"`. | + +## Unknown fields + +New fields appear across versions. A parser should flatten unknown keys +into an `extra` map rather than reject the line. + +## Parser surface vs. format surface + +This doc catalogues fields at the *format* level — every key we have +observed on an envelope. Our parser doesn't type all of them equally. +`toolpath-claude`'s `ConversationEntry` +([`crates/toolpath-claude/src/types.rs`](../../../../crates/toolpath-claude/src/types.rs)) +promotes the fields a typical consumer needs into named Rust fields and +lets everything else land in a `#[serde(flatten)] extra: HashMap`. + +| Field name (JSON) | Where it lives in `ConversationEntry` | +|----------------------------|---------------------------------------| +| `type` | `entry_type: String` | +| `uuid` | `uuid: String` | +| `timestamp` | `timestamp: String` | +| `sessionId` | `session_id: Option` | +| `parentUuid` | `parent_uuid: Option` | +| `cwd` | `cwd: Option` | +| `gitBranch` | `git_branch: Option` | +| `version` | `version: Option` | +| `userType` | `user_type: Option` | +| `isSidechain` | `is_sidechain: bool` (default false) | +| `message` | `message: Option` | +| `requestId` | `request_id: Option` | +| `toolUseResult` | `tool_use_result: Option` | +| `snapshot` | `snapshot: Option` | +| `messageId` | `message_id: Option` | +| *everything else* | `extra: HashMap` | + +Concretely, fields like `slug`, `agentId`, `entrypoint`, +`sourceToolAssistantUUID`, `permissionMode`, `attachment`, `operation`, +`content` (on `queue-operation`), `lastPrompt`, `subtype`, `durationMs`, +`messageCount`, `isMeta`, `isCompactSummary`, +`isVisibleInTranscriptOnly`, `logicalParentUuid`, `compactMetadata`, +`thinkingMetadata`, and the hook-injected envelope fields +(`hookCount`, `hookInfos`, `hookErrors`, `preventedContinuation`, +`stopReason`, `level`) are all documented here but accessed via +`entry.extra[""]` in code. This is deliberate — it keeps the +typed surface small while round-tripping arbitrary format drift. + +If you're building a new consumer in another language, the same split +tends to work: a core struct for the hot fields, a map for the rest. diff --git a/docs/agents/formats/claude-code/known-issues.md b/docs/agents/formats/claude-code/known-issues.md new file mode 100644 index 0000000..bc12f86 --- /dev/null +++ b/docs/agents/formats/claude-code/known-issues.md @@ -0,0 +1,155 @@ +# Known issues and corruption modes + +A parser that wants to be robust against real-world Claude Code output +needs to defend against format-level bugs, races, and drift. This is +the list we've built up from upstream issue trackers, our own testing, +and adjacent tools. + +## Format bugs Anthropic has shipped + +### Dangling `parentUuid` references + +An entry's `parentUuid` may reference a UUID that doesn't exist +anywhere in the file. Observed intermittently; causes DAG reconstruction +to fail if you assume every `parentUuid` resolves. + +**Defense:** during traversal, treat missing parents as roots rather +than erroring out. + +### `file-history-snapshot.messageId` collisions on resume + +On resume, `file-history-snapshot` entries reuse the `uuid` of prior +message entries as their `messageId`. This means if you index entries +by UUID, the snapshot collides with the real message. + +**Impact:** up to 25% of entries unreachable via `parentUuid` +traversal in heavily-resumed sessions. + +**Defense:** don't use `messageId` as a unique key across entry types. +Key by `(type, uuid)` or by `(type, messageId)` specifically for +snapshots. + +### Missing `stop_reason` on assistant entries + +Assistant entries are frequently persisted with `stop_reason: null` +even for turns that completed normally — the entry is written before +the streaming API response finalizes. + +**Defense:** don't treat `stop_reason == null` as "in-flight". Infer +`"end_turn"` for assistant entries missing it, unless you have direct +evidence the turn was interrupted. + +### Autocompact can destroy in-progress work + +Under some conditions an autocompact pass can replace in-progress +scratch work with a summary, effectively losing it. This is a Claude +Code bug, not a format bug — but if you're archiving sessions for +replay, a post-compact file may be less useful than the pre-compact +snapshot. + +**Defense:** if continuity matters, checkpoint before compaction. + +## Race conditions + +### Multi-terminal writes to the same project + +Multiple Claude Code instances launched in the same cwd may write to +**the same project directory** at overlapping times, producing +interleaved turns across session files or (more rarely) interleaved +lines within a single file. + +**Defense:** a session reader should gate discovery by `created_at` +timestamps or by content-match against known user input, so that one +instance doesn't accidentally pick up another's session. + +### `sessions-index.json` drift + +The optional `~/.claude/sessions/sessions-index.json` can be missing, +stale, or inconsistent with the files actually on disk. + +**Defense:** treat the index as a cache; rescan the filesystem when +precision matters. + +## Format-level invisibles + +These aren't bugs, but they trip up consumers who expect the JSONL to +be a complete record of what happened. + +### Permission prompts + +When Claude Code gates a tool call behind a permission prompt, the +prompt itself and the user's accept/deny decision leave **no JSONL +entry**. The only observable signal is a gap in timestamps between +the `tool_use` and the corresponding `tool_result`. + +**Defense:** if you're trying to distinguish "Claude took 5 seconds to +run Bash" from "user took 5 seconds deciding to allow Bash," you can't. +Treat timestamp gaps around tool results with uncertainty. + +### Mid-turn text-only assistant entries + +A single logical assistant turn can span multiple `assistant` JSONL +entries — thinking, text, and `tool_use` can each be separate entries. +Text-only assistant entries can appear **between** tool-call batches, +not only at end-of-turn. + +**Defense:** don't treat "assistant emitted text" as "turn is over." +Use `stop_reason: "end_turn"` or a following `system/turn_duration` +entry as the authoritative end-of-turn signal. + +### Agent-progress nesting variance + +Some agent-progress entries wrap the message double-nested +(`data.message.message` instead of `data.message`) depending on +version. If you're parsing Task-tool progress events, try both paths. + +## Drift between versions + +The format is stable at the coarse level but drifts in details. +Patterns: + +- **New optional fields** appear additively. Unknown envelope fields + should be preserved through reads/writes. +- **Entry-type enumerations** grow. New `system.subtype` values and + new top-level `type` values appear without notice. +- **Tool input schemas** change when Anthropic ships built-in tool + updates. Preserve unknown input keys. +- **`usage` subfields** (notably `server_tool_use`, `iterations`, + `cache_creation`) were added in 2.0.x → 2.1.x. Older files lack + them. + +**Defense:** always flatten unknown fields into a catch-all rather +than rejecting. Version-gate any behavior that assumes a specific +subfield is present. + +## Corruption modes to expect + +- **Truncated last line.** If Claude Code was killed mid-write, the + final line of a session JSONL may be partial or invalid JSON. + Expect an unterminated-string or unexpected-EOF parse error on + the tail. +- **Empty lines.** Blank lines appear mid-file in some samples. Skip + them rather than error. +- **Whitespace-only lines.** Same handling as empty lines. +- **Large output overflow.** Very large tool outputs (pages of text) + may be spilled to `projects///tool-results/` + rather than inlined. A parser that assumes `tool_result.content` is + always inline may return stub content instead of the real output. +- **Cycles in the parent DAG.** Rare, but possible with corrupted + data. Traverse defensively. + +## Things that look like bugs but aren't + +- **`gitBranch: ""`** — empty string, not null, is the correct + representation when cwd isn't a git repo. +- **`stop_reason: null`** — see above. Not a crash; a timing artifact. +- **Multiple assistant entries per turn** — intentional. Thinking, + text, and tool calls get their own entries. +- **Tool-result-only user entries** — intentional. These are synthesized, + not user input. Fold them into the preceding assistant turn when + rendering. +- **First entry's `sessionId` != filename stem** — intentional. Bridge + entry marking a session continuation. See + [session-chains.md](session-chains.md). +- **`compact_boundary.parentUuid: null`** — intentional. The real + prior message is in `logicalParentUuid`. diff --git a/docs/agents/formats/claude-code/messages.md b/docs/agents/formats/claude-code/messages.md new file mode 100644 index 0000000..e8c2f8e --- /dev/null +++ b/docs/agents/formats/claude-code/messages.md @@ -0,0 +1,179 @@ +# The `message` object + +`user` and `assistant` entries carry a `message` object containing the +actual conversation payload. Metadata entries (`system`, `attachment`, +`permission-mode`, etc.) do not. + +## Shape + +```jsonc +{ + "role": "user" | "assistant" | "system", + "content": string | ContentPart[], + "model": "claude-opus-4-6", // assistant only + "id": "msg_01P33i…", // assistant only — Anthropic API message ID + "type": "message", // assistant only — always the literal "message" + "stop_reason": "end_turn" | "tool_use" | "stop_sequence" | null, + "stop_sequence": string | null, + "usage": Usage // assistant only — see usage.md +} +``` + +Both `stop_reason` and `stop_sequence` can be `null`. `stop_reason` is +frequently `null` on disk even for turns that completed normally, +because the entry is persisted before the streaming API response +finalizes — see [known-issues.md](known-issues.md). + +## The string-or-array `content` trap + +`message.content` is **either a bare string or an array of content +parts**, depending on what the entry carries. A parser must handle both +shapes. + +```jsonc +// Bare string — typical for simple user prompts +{"role": "user", "content": "what does this file do?"} + +// Array of parts — assistant responses, tool results, slash commands, paste +{"role": "assistant", "content": [{"type": "text", "text": "Done."}]} +``` + +Empirically: +- **User entries** use both shapes. Direct prompts are strings; tool + result carriers are arrays. +- **Assistant entries** use arrays virtually always. Even a + plain-text response is wrapped as `[{"type": "text", "text": "…"}]`. + The Claude Code *loader* relies on this — see + [writing-compatible-jsonl.md](writing-compatible-jsonl.md). + +## Content part types + +Each part has a `type` discriminant. + +### `text` + +Plain text, the common case. + +```json +{"type": "text", "text": "I'll help with that."} +``` + +### `thinking` + +Extended-thinking output. Only emitted by models that support it +(`claude-opus-4-6` and later `-4-7`). Two fields: + +```jsonc +{ + "type": "thinking", + "thinking": "…reasoning text…", + "signature": "EoYDClkIDBgCKkDVU…" // base64; ~450 chars; cryptographic proof +} +``` + +The `signature` is an Anthropic-issued cryptographic proof of the +thinking content. Thinking blocks without a valid signature are +**rejected** as prior-turn context when the session is resumed — the +API won't replay them back to the model. A tool that rewrites or +truncates thinking content will break resume. + +Empty-string `thinking` values with a valid `signature` are observed +and legal (they represent thinking content that was later redacted). + +### `tool_use` + +A tool call issued by the assistant. + +```jsonc +{ + "type": "tool_use", + "id": "toolu_01LJtcvehSnXfMztfk8b8ZLC", + "name": "Grep", + "input": { + "pattern": "crab.?city", + "-i": true, + "output_mode": "files_with_matches" + }, + "caller": {"type": "direct"} // optional; origin of the call +} +``` + +- **`id`** — Anthropic tool-use ID with the `toolu_` prefix. Primary + key linking this `tool_use` to its eventual `tool_result`. +- **`name`** — tool name. Built-in tools use PascalCase (`Read`, + `Bash`, `Grep`). MCP-provided tools are namespaced: + `mcp____`. +- **`input`** — tool-specific parameter object. See [tools.md](tools.md). +- **`caller`** — optional. `{type: "direct"}` for directly-issued + calls. The full enumeration of `caller.type` values isn't well-known. + +### `tool_result` + +The result of a prior tool call, carried by the following **user** +entry. + +```jsonc +{ + "type": "tool_result", + "tool_use_id": "toolu_01LJtcvehSnXfMztfk8b8ZLC", + "content": "Found 79 files\npackages/…", // string OR array of text parts + "is_error": false // default false +} +``` + +- **`tool_use_id`** — matches the `id` from the issuing `tool_use`. +- **`content`** — string, or an array of objects shaped + `{text: "..."}`. An array with multiple parts should be joined with + `\n` to recover the intended output. +- **`is_error`** — defaults to `false`. `true` indicates the tool + raised or returned an error. + +See [tools.md](tools.md) for the full tool-invocation lifecycle. + +### Other content types (API-level, rarely seen on disk) + +The Anthropic API defines several content-block types that Claude Code +can in principle persist but that don't commonly show up in typical +coding sessions: + +- `image` — image blocks in messages +- `document` — attached documents +- `redacted_thinking` — thinking content redacted by Anthropic +- `server_tool_use` — server-executed tool calls (e.g. built-in web search) +- `web_search_tool_result` — results from server-side web search + +A tolerant parser preserves unknown `type` values rather than failing. + +## Role values + +Lowercase: `"user"`, `"assistant"`, `"system"`. The `role` inside +`message` is distinct from the envelope `type` — they agree in every +sample we've inspected (`type: "user"` ↔ `role: "user"`, +`type: "assistant"` ↔ `role: "assistant"`), but they are separate +fields in the format. Key type discrimination off the envelope `type`; +`role` carries redundant information. + +## Convenience: detecting "empty" user turns + +A user entry that carries only `tool_result` parts (no text, no image, +no other user input) is a synthesized tool-result carrier, not a real +user turn. Detect it with: + +```python +def is_tool_result_only(entry): + msg = entry.get("message") + if not msg or msg.get("role") != "user": + return False + content = msg.get("content") + if isinstance(content, str): + return False # bare string is always a real user turn + # content is an array + for part in content: + t = part.get("type") + if t != "tool_result": + return False + return bool(content) +``` + +UIs rendering a transcript typically fold these into the preceding +assistant turn. See [tools.md](tools.md). diff --git a/docs/agents/formats/claude-code/peripheral-files.md b/docs/agents/formats/claude-code/peripheral-files.md new file mode 100644 index 0000000..e0a1681 --- /dev/null +++ b/docs/agents/formats/claude-code/peripheral-files.md @@ -0,0 +1,185 @@ +# Peripheral files + +Everything under `~/.claude/` that isn't a session JSONL. Most of these +are supporting state that a conversation reader can ignore, but they +matter if you're building anything broader than a transcript viewer. + +## `~/.claude/history.jsonl` + +Global user-prompt history, across all projects. JSONL, same line +framing as sessions. + +### Line shape + +```json +{ + "display": "The following fails with an error: bazel build --compilation_mode opt //packages/galaxy/...", + "pastedContents": {}, + "timestamp": 1759167114955, + "project": "/Users/alex/Devel/empathic/empathic", + "sessionId": "..." +} +``` + +Fields: +- **`display`** — user's prompt text as shown in the UI. +- **`pastedContents`** — object mapping paste-slot IDs to pasted + content (e.g. files the user dragged in). Usually empty. +- **`timestamp`** — Unix milliseconds. Note: JSONL sessions use + ISO-8601; `history.jsonl` uses epoch ms. +- **`project`** — absolute workspace path (the pre-sanitization form). +- **`sessionId`** — session the prompt was submitted in. Absent in + older entries; treat as optional. + +Unlike session JSONLs, `history.jsonl` is not subject to the 30-day +cleanup — prompts are kept indefinitely. Users can opt out with +`CLAUDE_CODE_SKIP_PROMPT_HISTORY=1`. + +## `~/.claude/settings.json` + +User-level config. JSON, not JSONL. Holds permissions, hooks, +attribution settings, `statusLine` config, env vars. Project-local +overrides live at `/.claude/settings.json` and +`.claude/settings.local.json`. + +Not covered in detail here — see Claude Code's own settings docs when +writing a tool that consumes this file. + +## `~/.claude.json` + +(Note: file in `$HOME`, not under `~/.claude/`.) A single JSON file +holding: + +- OAuth tokens and account state. +- Per-project state: `mcpServers`, `allowedTools`, + `hasTrustDialogAccepted`, `lastSessionId`, `projectOnboardingSeen`. +- Cross-session preferences. + +Backed up automatically to `~/.claude/backups/.claude.json.backup.` +on changes. + +## `~/.claude/todos/` + +Per-session TodoWrite state. **Marked legacy** by Anthropic — current +Claude Code versions no longer write here; TodoWrite state is inlined +into the session's tool-result entries instead. Safe to delete. + +Filename convention: `-agent-.json`. + +Content: a JSON array of todo objects. + +```json +[ + {"content": "Design a combined stream…", "status": "completed", "priority": "high", "id": "1"}, + {"content": "Wire it to handle_connection", "status": "pending", "priority": "medium", "id": "2"} +] +``` + +## `~/.claude/shell-snapshots/` + +Captured shell environment at session start. One file per snapshot. + +Filename: `snapshot---.sh`, e.g. +`snapshot-zsh-1759194893070-ab12cd.sh`. + +Content: a shell script that recreates the user's environment — +function definitions, aliases, exported variables. Claude Code sources +these when running `Bash` tool commands so they operate in the same +environment the user would see. + +## `~/.claude/file-history/` + +Content-addressed file backups supporting undo/rollback. When Claude +Code edits a file, it writes the pre-edit content here before applying +the change. + +Structure: `/@v`. + +The `trackedFileBackups` object on `file-history-snapshot` entries +references entries here by (contentHash, versionNumber). + +## `~/.claude/session-env/` + +Per-session environment-variable snapshots. Typically one empty +directory per session (the dir itself is a presence marker). + +## `~/.claude/plans/` + +Plan-mode artifacts. When a user enters plan mode, the plan drafts +Claude produces land here. + +## `~/.claude/sessions/` + +A session cache / index. In some versions, contains a +`sessions-index.json` with a global view of all sessions across +projects. Fields per entry: + +``` +{version: 1, sessions: [ + {sessionId, fullPath, fileMtime, firstPrompt, messageCount, + created, modified, gitBranch, projectPath, isSidechain} +]} +``` + +Can be stale or missing. Treat it as a hint, not ground truth — rescan +`projects/` if precision matters. + +## `~/.claude/statsig/` + +Anthropic's feature-flag cache. Fixed filenames: + +- `statsig.cached.evaluations.` +- `statsig.session_id.` +- `statsig.stable_id.` + +Not useful to external tools. + +## `~/.claude/plugins/` + +Installed plugins directory. `plugins/blocklist.json` lists plugins +Anthropic has flagged: + +```json +{ + "fetchedAt": "...", + "plugins": [{"plugin": "...", "added_at": "...", "reason": "...", "text": "..."}] +} +``` + +## `~/.claude/debug/` + +Per-session debug logs. Text files, one per session. Verbose logging +of what Claude Code did internally. Useful for investigating bugs. + +## `~/.claude/paste-cache/` + +Clipboard / paste history. Attachments pasted into Claude Code land +here. Referenced by paste-slot IDs inside `history.jsonl`'s +`pastedContents`. + +## `~/.claude/image-cache/` + +Cached images from pasted screenshots, URLs, etc. + +## `~/.claude/ide/` + +Per-IDE lock files. When Claude Code runs in an IDE extension (VS Code, +JetBrains), each active IDE process creates a `.lock` file here to +coordinate. + +## `~/.claude/backups/`, `~/.claude/cache/`, `~/.claude/chrome/` + +Internal state directories, generally not interesting to external +tools. + +## `stats-cache.json` + +File at `~/.claude/stats-cache.json`. Cached usage statistics for the +CLI's stats commands. Full schema varies; not commonly consumed +externally. + +## Cleanup / retention + +Session files and most peripheral state are cleaned up after +`cleanupPeriodDays` days (default 30). `history.jsonl` is not on that +schedule. Plugins, settings, and OAuth state are never cleaned up. diff --git a/docs/agents/formats/claude-code/session-chains.md b/docs/agents/formats/claude-code/session-chains.md new file mode 100644 index 0000000..b14406f --- /dev/null +++ b/docs/agents/formats/claude-code/session-chains.md @@ -0,0 +1,219 @@ +# Session chains, rotation, and compaction + +A single logical Claude Code conversation can span multiple JSONL files. +Claude Code **rotates** to a new file under several conditions, and +**compacts** the in-memory conversation mid-file when the context +window fills. Reconstructing a full logical conversation means +understanding both mechanics. + +## Rotation triggers + +Claude Code begins a new JSONL file (new session UUID, new filename) in +at least these situations: + +- **Context overflow / autocompaction in some versions** — older + versions rotated to an `acompact-.jsonl` on autocompact. + Current versions usually keep compaction inline; see + [§Compaction](#compaction--compact_boundary). +- **Exiting plan mode** — leaving plan mode reliably produces a new + file. +- **Explicit resume** — `claude --resume ` continues + into a new file. +- **`--fork-session`** — explicitly fork a session into a branch. +- **Manual user action** (e.g. `/clear` followed by continuation in + some setups). + +Don't rely on any filename prefix as a rotation signal — modern +versions give successor files the same UUID-naming scheme as fresh +sessions. + +## Detecting a continuation + +When a file exists that continues a prior session, the continuation is +signalled by one or more of these markers: + +### 1. The bridge entry (primary signal) + +The **first real entry** of a continuation file carries the **previous +session's** `sessionId`, not its own filename UUID. + +``` +file: session-B.jsonl +line 1 (first real entry): { "sessionId": "session-A", ... } ← bridge +line 2+: { "sessionId": "session-B", ... } +``` + +This is the most reliable signal. Algorithm for classifying any file: + +```python +def is_continuation_file(path): + first_entry = read_first_entry(path) + file_uuid = filename_stem(path) + return first_entry.get("sessionId") != file_uuid +``` + +A file whose first entry's `sessionId` matches the filename is a +**chain head** (fresh session or the start of a multi-file chain). +A file whose first entry's `sessionId` is different is a **successor** +of that earlier sessionId. + +### 2. The `slug` field + +A conversation slug (e.g. `"crystalline-giggling-sunset"`) persists +across rotations. Files sharing the same `slug` belong to the same +logical conversation. + +### 3. Duplicate `compact_boundary` + +A continuation file may begin with a byte-for-byte identical +`compact_boundary` record copied from the parent, followed by the +synthetic compaction summary. + +### 4. `parentUuid` bridge + +The new file's first real entry's `parentUuid` points to the parent +file's last entry's `uuid`. + +## Chain resolution + +Given a set of files in a project directory, you can build the full +chain graph by: + +1. For each file, read the first real entry and compute + `(filename_uuid, first_entry_sessionId)`. +2. If `first_entry_sessionId != filename_uuid`, record an edge: + `first_entry_sessionId → filename_uuid` (predecessor → successor). +3. Walk backward from any given file to the chain **head** (no + predecessor) and forward to the chain **tail** (no successor). + +A chain head's `sessionId` equals its filename UUID; every other +file in the chain has an incoming edge. + +**Cycle defense:** don't assume the graph is always acyclic. Corrupt +data has produced self-loops and cycles in practice. Track visited +nodes during traversal. + +## Reading a logical conversation + +To present a multi-file conversation as a single stream: + +1. Read the chain in order (head → tail). +2. Concatenate entries. +3. **Skip bridge entries.** The first real entry of any non-head file + is a replay of the parent's last entry (it carries the parent's + `sessionId`) and would cause duplicates if kept. + +```python +def read_chain(chain_files): + for i, path in enumerate(chain_files): + for j, entry in enumerate(read_jsonl(path)): + is_first_real = (j == first_real_line(path)) + is_bridge = ( + is_first_real and i > 0 and + entry.get("sessionId") != filename_stem(path) + ) + if is_bridge: + continue + yield entry +``` + +Or simpler: **keep only entries whose `sessionId` matches the filename's +session ID**. This drops bridge entries and any other replayed +prefix records. + +## Compaction — `compact_boundary` + +When the context window fills, Claude Code runs an autocompaction pass +that replaces the older portion of the conversation with a summary. +The JSONL records this inline with a `compact_boundary` entry. + +### The boundary marker + +```jsonc +{ + "type": "compact_boundary", // or "system" with "subtype": "compact_boundary" + "uuid": "...", + "parentUuid": null, // always null on the boundary + "logicalParentUuid": "...", // points at the real prior message + "compactMetadata": { + "trigger": "auto", // "auto" or "manual" (user ran /compact) + "preTokens": 180000 // conversation size before compaction + }, + "sessionId": "...", + "timestamp": "..." +} +``` + +Key property: `parentUuid` is `null`, resetting the DAG. The actual +prior message is referenced via `logicalParentUuid` so UIs can still +render the pre-compact history. + +### The synthetic summary + +Immediately after the boundary, a synthetic `user`-role message carries +the compacted-history summary as its content: + +```jsonc +{ + "type": "user", + "message": { + "role": "user", + "content": "...compacted summary text..." + }, + "isCompactSummary": true, + "isVisibleInTranscriptOnly": true, + ... +} +``` + +These flags tell the loader: +- **`isCompactSummary: true`** — this is the synthesized summary, not + real user input. +- **`isVisibleInTranscriptOnly: true`** — show it in the UI transcript + but don't replay it to the model (the model sees the actual + compacted context, which was already injected on the API side). + +When rendering a transcript, skip these synthetic summary entries or +mark them specially; treating them as real user messages will confuse +consumers. + +### Compaction strategies + +Several compaction strategies exist internally: + +- **Full compaction** — the default autocompact. Everything older than + a threshold is replaced by a summary. +- **Session-memory compaction** — summary preserved in session memory + across rotations. +- **Microcompaction** — after the prompt cache's 1h TTL expires for + older tool outputs, those outputs are cleared while keeping the last + ~5 tool results. Less aggressive than full compaction. + +From a format perspective, all three produce the same +`compact_boundary` + synthetic-summary sequence; the `trigger` value +and the `preTokens` count are the discriminators. + +## Subagent / sidechain files + +Older Claude Code versions stored subagents in separate files: + +``` +projects//subagents/agent-.jsonl +projects//subagents/agent-.meta.json +projects//subagents/agent-acompact-.jsonl # compacted +``` + +Newer versions inline sidechain entries into the main session JSONL +with `isSidechain: true` and an `agentId` field. See +[entry-types.md §Sidechains](entry-types.md#sidechains). + +## Pitfalls + +- A file's `sessionId` on **non-first** entries is a less reliable + signal than the first entry's `sessionId`. Hook-injected or + cross-referenced entries sometimes carry a sessionId that matches a + different file. +- The `sessions-index.json` index sometimes gets out of sync with the + actual files on disk. Re-scan rather than trust the index blindly. +- A continuation can cross project directories if `cwd` changed + mid-session (rare but possible). diff --git a/docs/agents/formats/claude-code/tools.md b/docs/agents/formats/claude-code/tools.md new file mode 100644 index 0000000..4df026c --- /dev/null +++ b/docs/agents/formats/claude-code/tools.md @@ -0,0 +1,229 @@ +# Tool calls + +Tool calls are the most complex part of the format. A single tool +invocation produces **two JSONL entries** plus optional top-level +summary data, and the pairing has to be reconstructed by the reader. + +## The two-entry lifecycle + +A tool call is always a pair: + +1. **Assistant entry** with a `tool_use` content part in + `message.content`. This is the call. +2. **User entry** whose `message.content` is an array of `tool_result` + parts. This is the response. + +Example (abbreviated): + +```jsonc +// 1) Assistant issues the call +{ + "type": "assistant", + "uuid": "3995c068-...", + "message": { + "role": "assistant", + "content": [ + { + "type": "tool_use", + "id": "toolu_01LJtcvehSnXfMztfk8b8ZLC", + "name": "Grep", + "input": {"pattern": "TODO"} + } + ], + "stop_reason": "tool_use" + } +} + +// 2) User entry (synthesized, not human-typed) carries the result +{ + "type": "user", + "uuid": "1e5e8efb-...", + "parentUuid": "3995c068-...", + "sourceToolAssistantUUID": "3995c068-...", + "message": { + "role": "user", + "content": [ + { + "type": "tool_result", + "tool_use_id": "toolu_01LJtcvehSnXfMztfk8b8ZLC", + "content": "src/main.rs\nsrc/lib.rs\n" + } + ] + }, + "toolUseResult": { + "mode": "files_with_matches", + "filenames": ["src/main.rs", "src/lib.rs"], + "numFiles": 2 + } +} +``` + +The **primary key** linking the two is `tool_use.id` ↔ +`tool_result.tool_use_id`. The envelope also provides a convenience +back-reference: the tool-result-carrying user entry has +`sourceToolAssistantUUID` pointing at the `uuid` of the issuing +assistant entry. + +## One assistant entry, multiple tool calls + +An assistant entry can include multiple `tool_use` parts in its +`content` array. Each gets its own matching `tool_result` part in +the following user entry's `content` array. + +```jsonc +// Assistant: two tool calls in one entry +{"content": [ + {"type": "tool_use", "id": "toolu_A", "name": "Read", "input": {...}}, + {"type": "tool_use", "id": "toolu_B", "name": "Grep", "input": {...}} +]} + +// User: two results matched by ID +{"content": [ + {"type": "tool_result", "tool_use_id": "toolu_A", "content": "..."}, + {"type": "tool_result", "tool_use_id": "toolu_B", "content": "..."} +]} +``` + +Parallel tool calls — multiple `tool_use` parts in a single assistant +entry — are a normal pattern that agents use to fan out reads. + +## Tool result `content` shape + +The `content` field on a `tool_result` is either a string or an array +of text-carrying objects: + +```jsonc +// String form +{"type": "tool_result", "tool_use_id": "...", "content": "file contents"} + +// Array form +{"type": "tool_result", "tool_use_id": "...", "content": [ + {"text": "line 1"}, + {"text": "line 2"} +]} +``` + +To recover the output, join array-form parts with `\n`. + +## The top-level `toolUseResult` + +The tool-result-carrying user entry may also have a top-level +`toolUseResult` field (sibling of `message`, not nested inside it). +This is a **structured summary** of the tool's output, populated for +tools that have structured outputs. + +Whether `toolUseResult` is present depends on the tool. Tools with +structured outputs emit it; tools whose output is a single blob of text +or a status string leave it absent. + +| Tool | Top-level `toolUseResult`? | Inline `tool_result.content`? | +|-------------|----------------------------|-------------------------------| +| `Read` | no | yes — file contents as string | +| `Write` | no | yes — success string | +| `Edit` | no | yes — diff or success string | +| `Bash` | no | yes — stdout/stderr as string | +| `Grep` | **yes** | yes — human-readable summary | +| `Glob` | **yes** | yes | +| `TodoWrite` | **yes** | yes | +| `Task` | **yes** | yes — agent summary | +| `WebSearch` | **yes** | yes | +| `WebFetch` | **yes** (when the fetch succeeds) | yes | + +MCP-provided tools follow their server's conventions rather than +Claude Code's. We have not seen `toolUseResult` populated for MCP tools +in the samples we've inspected. + +### Observed `toolUseResult` shapes + +**`Grep`:** +```json +{ + "mode": "files_with_matches", + "filenames": ["packages/.../MainHeader.svelte", "..."], + "numFiles": 79 +} +``` + +Alternative `mode` values: `"content"` (shows matching lines), +`"count"` (per-file match counts), mirroring the tool's `output_mode` +parameter. + +**`Glob`** (expected shape): +```json +{"filenames": ["src/a.rs", "src/b.rs"], "durationMs": 12} +``` + +**`TodoWrite`** (expected shape): +```json +{"todos": [{"content": "...", "status": "completed", "activeForm": "..."}, ...]} +``` + +**`Task`** (expected shape): +```json +{ + "agentId": "a7bf2fd", + "totalTokens": 12345, + "totalToolUseCount": 7, + "usage": {...}, + "result": "...summary text..." +} +``` + +### Spilled-to-disk outputs + +Very large tool outputs may be spilled to files under +`projects///tool-results/` rather than inlined into +`tool_result.content`. In that case the content field contains a +reference. We haven't captured a concrete example yet — parsers that +assume content is always inline may break on very long outputs. + +## Common tool `input` shapes + +The built-in tool set is **not stable** — Anthropic ships new tools, +renames old ones, and adds/removes input fields between Claude Code +releases. The list below is illustrative, not canonical; it captures +what we've observed in Claude Code 2.1.x samples. Consult Anthropic's +tool documentation for the authoritative current shapes, and treat any +unfamiliar field on a known tool as additive drift rather than a +parsing error. + +- **`Read`** — `{file_path, offset?, limit?, pages?}` +- **`Write`** — `{file_path, content}` +- **`Edit`** — `{file_path, old_string, new_string, replace_all?}` +- **`Bash`** — `{command, description?, timeout?, run_in_background?}` +- **`Grep`** — `{pattern, path?, glob?, type?, output_mode?, -i?, -A?, -B?, -C?, -n?, head_limit?, offset?, multiline?}` +- **`Glob`** — `{pattern, path?}` +- **`WebFetch`** — `{url, prompt}` +- **`WebSearch`** — `{query, allowed_domains?, blocked_domains?}` +- **`Task`** / **`Agent`** — `{description, prompt, subagent_type?}` +- **`TodoWrite`** — `{todos: [{content, status, activeForm}]}` +- **`NotebookEdit`** — `{notebook_path, cell_id?, new_source, cell_type?, edit_mode?}` + +Tools provided by MCP servers appear as `mcp____`. +The `input` schema is controlled by the MCP server, not Claude Code — +the list above has nothing to say about them. + +## Consumer concerns + +### Cross-boundary pairing + +When watching a live session, a `tool_use` may appear in one read +window and its `tool_result` in a later one. A consumer that emits +turns eagerly needs a "late join" step that merges the result back +into the earlier assistant turn once it arrives. + +### Permission prompts leave no trace + +Claude Code can gate tool execution behind a permission prompt. The +prompt itself and the user's accept/deny decision are **not** recorded +in the JSONL. The only signal is a gap in timestamps between the +`tool_use` and the `tool_result`. A denied tool call still eventually +produces a `tool_result` (with `is_error: true` or a denial message). + +### Synthesized user entries + +The user entry carrying `tool_result`s was not typed by a human. A +transcript-rendering UI should usually fold it into the preceding +assistant turn rather than show it as "the user said …". Detection: +the entry is `type: "user"`, the only content-part kinds are +`tool_result`, and there's a `sourceToolAssistantUUID` envelope field. diff --git a/docs/agents/formats/claude-code/usage.md b/docs/agents/formats/claude-code/usage.md new file mode 100644 index 0000000..a471376 --- /dev/null +++ b/docs/agents/formats/claude-code/usage.md @@ -0,0 +1,128 @@ +# Token usage accounting + +Assistant entries carry a `usage` object inside `message.usage` that +records the token counts Anthropic billed for that turn, including +prompt-cache statistics. The shape has grown over time and now mixes +flat fields with nested breakdowns that duplicate the flat totals. + +## Full observed shape + +```jsonc +{ + "input_tokens": 2, + "output_tokens": 21, + + "cache_creation_input_tokens": 12291, + "cache_read_input_tokens": 12361, + + "cache_creation": { + "ephemeral_5m_input_tokens": 0, + "ephemeral_1h_input_tokens": 12291 + }, + + "server_tool_use": { + "web_search_requests": 0, + "web_fetch_requests": 0 + }, + + "service_tier": "standard", + "inference_geo": "", + "speed": "standard", + + "iterations": [ + { + "input_tokens": 2, + "output_tokens": 21, + "cache_read_input_tokens": 12361, + "cache_creation_input_tokens": 12291, + "cache_creation": {"ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 12291}, + "type": "message" + } + ] +} +``` + +## Field reference + +### Core token counts + +- **`input_tokens`** — tokens sent in the request (input side, after + prompt-cache reads). +- **`output_tokens`** — tokens generated by the model in this response. + +### Prompt caching + +Prompt-cache counts are **separate from** `input_tokens`. The model +sees all three, but they're billed at different rates (cache writes +are more expensive than regular inputs; cache reads are cheaper). + +- **`cache_creation_input_tokens`** — tokens written to the prompt + cache during this call. +- **`cache_read_input_tokens`** — tokens retrieved from the prompt + cache during this call. +- **`cache_creation`** — a nested breakdown of + `cache_creation_input_tokens` by TTL bucket: + - `ephemeral_5m_input_tokens` — cached for ~5 minutes. + - `ephemeral_1h_input_tokens` — cached for ~1 hour. + + The two TTL buckets sum to the flat `cache_creation_input_tokens`. + Older versions of Claude Code only emitted the flat field; newer + versions emit both. + +### Server-side tool use + +- **`server_tool_use`** — counts of tool calls executed + server-side by Anthropic (as opposed to client-side by Claude Code). + Observed subfields: `web_search_requests`, `web_fetch_requests`. Other + server-tool subfields may appear as Anthropic ships new built-ins. + +### Billing / routing + +- **`service_tier`** — `"standard"` or `"batch"`. Determines billing + rate. Sessions run through batch API land in `"batch"`; everything + else is `"standard"`. +- **`inference_geo`** — geographic region the request was routed to. + Often empty string; otherwise a region code. +- **`speed`** — inference speed tier; `"standard"` is the only value + we've seen. + +### Agentic loop iterations + +- **`iterations`** — array of per-iteration usage objects for turns + that ran through an internal agentic loop (multi-step tool-use + cycles inside a single message API call). Each iteration has the + same shape as the enclosing `usage`. + + The enclosing totals (`input_tokens`, `output_tokens`, etc.) are + typically the sum of the iterations' corresponding fields. Not all + turns have `iterations`. + +## Older vs. newer shapes + +Versioning is gradual and largely additive. Rules of thumb: + +| Shape feature | Since | +|-----------------------------------------------------------|--------------| +| Flat `input_tokens` / `output_tokens` | always | +| Flat `cache_creation_input_tokens` / `cache_read_input_tokens` | always (for cached prompts) | +| Nested `cache_creation: {ephemeral_5m, ephemeral_1h}` | 2.0.x+ | +| `service_tier` | 2.0.x+ | +| `server_tool_use` | 2.1.x+ | +| `iterations` | 2.1.x+ | +| `inference_geo`, `speed` | 2.1.x+ (often empty) | + +The practical consequence: any field may be absent on an older entry. +A tool doing token accounting should treat everything but +`input_tokens` / `output_tokens` as `Optional` and default to +zero when summing. + +## Not recorded in `usage` + +- **Thinking tokens** are not reported separately. They are included + in the `output_tokens` total. +- **Sidechain usage** is recorded on the sidechain's own assistant + entries, not rolled up into the parent conversation's totals. + Cache-read tokens on sidechain entries often "mirror" the parent + because the prompt cache is shared. +- **User-entry token counts** are not recorded. Only assistant entries + carry `usage`. diff --git a/docs/agents/formats/claude-code/walkthrough.md b/docs/agents/formats/claude-code/walkthrough.md new file mode 100644 index 0000000..1bd5247 --- /dev/null +++ b/docs/agents/formats/claude-code/walkthrough.md @@ -0,0 +1,282 @@ +# Walkthrough: a session, line by line + +Every other doc in this directory is a field-by-field reference. This +one is the complement: a single representative session, read linearly, +with commentary at each line about what you're seeing and which doc +defines it. + +Use this to: + +- Orient yourself before diving into the reference docs. +- Check that your parser handles every common shape in order. +- Have a concrete example to point at when explaining the format to a + new contributor. + +The JSONL below is **synthesized for illustration** — UUIDs, +timestamps, and `tool_use` IDs are not taken from a real file. The +shapes, field names, and ordering follow what we've observed in real +sessions. For the exact field definitions, follow the cross-links. + +## Setting + +Claude Code was launched in `/Users/alex/demo`. The file we're reading +is at `~/.claude/projects/-Users-alex-demo/c0ff2e4a-1111-4222-9333-444455556666.jsonl`. +See [directory-layout.md](directory-layout.md) for how that path was +chosen. + +The user asked Claude to read a file and then summarize it. Then the +session got long enough to trigger a compaction, and the user asked a +follow-up. + +--- + +## Line 1 — `permission-mode` + +```json +{"type":"permission-mode","permissionMode":"default","sessionId":"c0ff2e4a-1111-4222-9333-444455556666"} +``` + +**Purpose:** records which permission mode was active when the session +started. + +**Things to notice:** + +- Exactly three fields. Adding anything — even `uuid: ""` or + `timestamp` — causes the loader to reject the line. See + [writing-compatible-jsonl.md §`permission-mode` entries must be + exactly three fields](writing-compatible-jsonl.md#2-permission-mode-entries-must-be-exactly-three-fields). +- `sessionId` matches the filename stem, so this is a **chain head** + (not a continuation file). See [session-chains.md](session-chains.md). + +--- + +## Line 2 — `user` (direct prompt) + +```json +{"type":"user","uuid":"u-0001","parentUuid":null,"sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T09:00:01.123Z","isSidechain":false,"userType":"external","entrypoint":"cli","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","message":{"role":"user","content":"what's in src/main.rs?"}} +``` + +**Purpose:** the first real user prompt. + +**Things to notice:** + +- `parentUuid: null` because this is the first conversation entry in + the session. See [jsonl-envelope.md §Identity / + position](jsonl-envelope.md#identity--position). +- `gitBranch: ""` (empty string, not `null`) because the cwd isn't a + git repo. See [known-issues.md §Things that look like bugs but + aren't](known-issues.md#things-that-look-like-bugs-but-arent). +- `message.content` is a **bare string**. Legal on user entries; + **illegal on assistant entries** — see line 3. See + [messages.md §The string-or-array `content` + trap](messages.md#the-string-or-array-content-trap). +- Envelope keys are camelCase (`parentUuid`, `sessionId`, + `isSidechain`, `userType`, `gitBranch`). API-adjacent keys inside + `message` will be snake_case when they appear (`stop_reason`, + `input_tokens`). See + [jsonl-envelope.md §Field conventions](jsonl-envelope.md#field-conventions). + +--- + +## Line 3 — `assistant` (tool call) + +```json +{"type":"assistant","uuid":"a-0001","parentUuid":"u-0001","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T09:00:02.456Z","isSidechain":false,"userType":"external","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","requestId":"req_01ABCD","message":{"role":"assistant","type":"message","id":"msg_01EFGH","model":"claude-opus-4-6","content":[{"type":"text","text":"I'll read it."},{"type":"tool_use","id":"toolu_001","name":"Read","input":{"file_path":"/Users/alex/demo/src/main.rs"}}],"stop_reason":"tool_use","stop_sequence":null,"usage":{"input_tokens":12,"output_tokens":48,"cache_creation_input_tokens":8421,"cache_read_input_tokens":0,"cache_creation":{"ephemeral_5m_input_tokens":0,"ephemeral_1h_input_tokens":8421},"service_tier":"standard"}}} +``` + +**Purpose:** Claude acknowledges and calls `Read`. + +**Things to notice:** + +- `message.content` is an **array**. Even a plain-text response is + wrapped as `[{"type":"text","text":"..."}]`. The loader will crash + on a bare-string content here. See + [writing-compatible-jsonl.md §Assistant `message.content` must always + be an array](writing-compatible-jsonl.md#1-assistant-messagecontent-must-always-be-an-array). +- Two parts in one entry — a `text` part *and* a `tool_use` part. + That's common: Claude narrates the action in the same entry that + issues the call. See [messages.md §Content part + types](messages.md#content-part-types). +- `stop_reason: "tool_use"` signals that the assistant expects a tool + result next. +- `requestId` is the Anthropic API request ID (prefix `req_`). + `message.id` is the Anthropic API message ID (prefix `msg_`). + Both are distinct from the envelope `uuid`. See + [jsonl-envelope.md §API correlation](jsonl-envelope.md#api-correlation). +- `usage.cache_creation` breaks down `cache_creation_input_tokens` + into 5-minute and 1-hour TTL buckets. See + [usage.md §Prompt caching](usage.md#prompt-caching). + +--- + +## Line 4 — `user` (tool result carrier) + +```json +{"type":"user","uuid":"u-0002","parentUuid":"a-0001","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T09:00:02.891Z","isSidechain":false,"userType":"external","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","sourceToolAssistantUUID":"a-0001","message":{"role":"user","content":[{"type":"tool_result","tool_use_id":"toolu_001","content":"fn main() { println!(\"hello\"); }\n","is_error":false}]}} +``` + +**Purpose:** carries the tool result back. **Not human-typed** — a +transcript UI should fold this into the previous assistant turn rather +than render it as a user message. + +**Things to notice:** + +- `type: "user"` but every content part is `tool_result`. Use the + `classify_user()` decision tree in + [entry-types.md §Classifying a user entry](entry-types.md#classifying-a-user-entry) + to recognize this. +- `sourceToolAssistantUUID: "a-0001"` is the convenience back-reference + to the issuing assistant entry. The primary link is + `tool_result.tool_use_id` ↔ `tool_use.id`. See + [tools.md §The two-entry lifecycle](tools.md#the-two-entry-lifecycle). +- No top-level `toolUseResult` — `Read` doesn't have a structured- + output summary. Compare to `Grep` in + [tools.md §The top-level `toolUseResult`](tools.md#the-top-level-tooluseresult). + +--- + +## Line 5 — `assistant` (text only, same logical turn) + +```json +{"type":"assistant","uuid":"a-0002","parentUuid":"u-0002","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T09:00:03.210Z","isSidechain":false,"userType":"external","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","requestId":"req_01IJKL","message":{"role":"assistant","type":"message","id":"msg_01MNOP","model":"claude-opus-4-6","content":[{"type":"text","text":"It's a minimal Rust program that prints \"hello\"."}],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":3,"output_tokens":14,"cache_read_input_tokens":8421,"service_tier":"standard"}}} +``` + +**Things to notice:** + +- `stop_reason: null` even though the turn ended cleanly. That's a + timing artifact — the entry was written before the streaming API + response finalized. Don't infer "turn still in flight" from this. + See [known-issues.md §Missing `stop_reason` on assistant + entries](known-issues.md#missing-stop_reason-on-assistant-entries). +- One logical assistant turn spans **two** `assistant` entries + (`a-0001` with the tool call, `a-0002` with the text reply). Never + assume one `assistant` JSONL entry = one turn. See + [entry-types.md §`assistant`](entry-types.md#assistant). +- `cache_read_input_tokens` is high while `cache_creation_input_tokens` + is zero — the prompt cache warmed up on the previous turn is now + being read. + +--- + +## Line 6 — `system` (turn_duration) + +```json +{"type":"system","subtype":"turn_duration","uuid":"s-0001","parentUuid":"a-0002","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T09:00:03.215Z","durationMs":2092,"messageCount":4} +``` + +**Purpose:** authoritative "turn ended" signal. More reliable than +`stop_reason` for detecting end-of-turn. + +**Things to notice:** + +- `subtype` is an open enumeration. Treat unknown values as "log and + keep." See + [entry-types.md §`system`](entry-types.md#system). +- `messageCount: 4` refers to the four entries in this turn + (`u-0001`, `a-0001`, `u-0002`, `a-0002`). + +--- + +## …much later in the session… + +After many more turns, the conversation approaches the context window +limit and triggers an autocompaction. + +## Line 117 — `compact_boundary` + +```json +{"type":"compact_boundary","uuid":"cb-0001","parentUuid":null,"logicalParentUuid":"a-0116","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T10:14:52.000Z","compactMetadata":{"trigger":"auto","preTokens":180000}} +``` + +**Things to notice:** + +- `parentUuid: null` — the DAG resets here. The real prior message + lives in `logicalParentUuid`. See + [session-chains.md §The boundary + marker](session-chains.md#the-boundary-marker). +- In some versions this shows up as `type: "system"` with + `subtype: "compact_boundary"` instead of the top-level + `compact_boundary` type. Treat as equivalent. See + [format-changelog.md](format-changelog.md). +- `compactMetadata.trigger: "auto"` — the client compacted on its own, + not because the user ran `/compact`. + +## Line 118 — synthetic summary (user entry) + +```json +{"type":"user","uuid":"u-0118","parentUuid":"cb-0001","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T10:14:52.010Z","isSidechain":false,"userType":"external","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","isCompactSummary":true,"isVisibleInTranscriptOnly":true,"message":{"role":"user","content":"Summary: user asked to read src/main.rs; I reported it's a minimal hello-world Rust program. We then…"}} +``` + +**Things to notice:** + +- `isCompactSummary: true` flags this as the synthesized summary, not + real user input. +- `isVisibleInTranscriptOnly: true` tells the loader: show it in the + UI but don't replay it to the model (the model already sees the + real compacted context on the API side). +- UIs should mark this specially — treating it as "the user said X" + will confuse consumers. + +--- + +## Line 119 — `user` (real follow-up after compaction) + +```json +{"type":"user","uuid":"u-0119","parentUuid":"u-0118","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T10:15:10.000Z","isSidechain":false,"userType":"external","entrypoint":"cli","cwd":"/Users/alex/demo","gitBranch":"","version":"2.1.112","message":{"role":"user","content":"can you add error handling?"}} +``` + +Back to a normal direct prompt. Life resumes. + +--- + +## What a rotation would look like + +Say this session gets long enough that Claude Code rotates to a new +file. The next file (`d1eeeeee-2222-4333-a444-555566667777.jsonl`) +would open with a **bridge entry**: + +```json +{"type":"user","uuid":"u-0200","parentUuid":"a-0199","sessionId":"c0ff2e4a-1111-4222-9333-444455556666","timestamp":"2026-04-23T10:40:00.000Z","...":"..."} +``` + +Note the `sessionId` is the **previous** session's ID, not the new +file's. That mismatch is the primary signal that this file is a +continuation. See [session-chains.md §Detecting a +continuation](session-chains.md#detecting-a-continuation). A reader +reconstructing the logical conversation should **skip the bridge +entry** — it's a replay of the parent's last line. + +Subsequent entries in the new file carry the new `sessionId` +(`d1eeeeee-…`). + +## What a sidechain would look like + +A `Task` tool call spawns a subagent. In 2.1.x the sidechain entries +are **inlined into the same JSONL**, marked by: + +- `isSidechain: true` on every entry in the subagent thread +- `agentId` — a short hash identifying the subagent (e.g. `"a7bf2fd"`) +- The thread root has `parentUuid: null`; its `message.content` is + the Task tool's input prompt + +Sidechain `usage` is accounted on the sidechain's own assistant +entries and does not roll up into the parent conversation's totals. +See [entry-types.md §Sidechains](entry-types.md#sidechains). + +--- + +## If you made it here + +You now have a mental model for the line-level shape of a Claude Code +session. The reference docs fill in the corners: + +- The full envelope field catalogue: [jsonl-envelope.md](jsonl-envelope.md). +- Every entry type and its variants: [entry-types.md](entry-types.md). +- The `message` object and content parts: [messages.md](messages.md). +- Tool-call lifecycle in detail: [tools.md](tools.md). +- Token accounting: [usage.md](usage.md). +- Multi-file chains and compaction: [session-chains.md](session-chains.md). +- Everything outside the session JSONL: [peripheral-files.md](peripheral-files.md). +- Rules for writing your own: [writing-compatible-jsonl.md](writing-compatible-jsonl.md). +- Bugs, drift, corruption modes: [known-issues.md](known-issues.md). +- Version-keyed field history: [format-changelog.md](format-changelog.md). diff --git a/docs/agents/formats/claude-code/writing-compatible-jsonl.md b/docs/agents/formats/claude-code/writing-compatible-jsonl.md new file mode 100644 index 0000000..30d3782 --- /dev/null +++ b/docs/agents/formats/claude-code/writing-compatible-jsonl.md @@ -0,0 +1,182 @@ +# Writing JSONL that Claude Code will load + +If your tool produces session JSONL that you want Claude Code to load +(for resume, for replay, for synthesizing a starting state), there are +rules beyond "valid JSON per line." The loader is strict about several +things. This document captures the constraints we've discovered +empirically; the list is not exhaustive. + +## Where Claude Code looks + +Claude Code reads sessions from +`~/.claude/projects//*.jsonl`. To make a file visible to +`claude --resume` or the session picker: + +1. Determine the **canonical** absolute path of the intended working + directory (on macOS, resolve `/tmp` → `/private/tmp`). +2. Sanitize by replacing `/` and `_` with `-`. +3. Create the directory at `~/.claude/projects//` if it + doesn't exist. +4. Write `.jsonl` inside it. The UUID must be a UUIDv4; + the filename stem is the session ID. + +See [directory-layout.md](directory-layout.md) for details. + +## Hard rules + +These are empirically confirmed. Violating them causes the loader to +reject or mis-render the session. + +### 1. Assistant `message.content` must always be an array + +Claude Code's loader unconditionally calls array methods +(`.map(…)` and similar) on `message.content` for `assistant` entries. +A bare-string content crashes the loader — even for a plain text +response. + +```jsonc +// ❌ Broken +{"type": "assistant", "message": {"role": "assistant", "content": "Done."}} + +// ✅ Correct +{"type": "assistant", "message": {"role": "assistant", + "content": [{"type": "text", "text": "Done."}]}} +``` + +User entries tolerate both shapes. + +### 2. `permission-mode` entries must be exactly three fields + +Observed legal shape: + +```json +{"type": "permission-mode", "permissionMode": "default", "sessionId": "..."} +``` + +Adding any additional field — even seemingly benign ones like `uuid: ""`, +`isSidechain: false`, or a `timestamp` — causes the loader to reject +the entry. If you're using a typed serializer that emits default +values, emit this entry as raw JSON instead. + +### 3. `cwd` must be the canonical path + +The `cwd` on an entry must match the pre-sanitization form of the +directory name under `projects/`. If you place a file in +`projects/-private-tmp-foo/` but put `"cwd": "/tmp/foo"` on the +entries, Claude Code's behavior is inconsistent — the session may +appear but tool invocations may fail. + +## Strong conventions + +Real Claude Code entries always carry these envelope fields. The +loader does not reject their absence outright, but UI rendering and +downstream tools may behave oddly without them: + +| Field | Why it matters | +|---------------|----------------| +| `userType` | `"external"` on all typical user-written entries. | +| `entrypoint` | `"cli"` (or appropriate value for the launching surface). | +| `cwd` | Used for git/branch context and to validate project alignment. | +| `version` | Claude Code client version string. | +| `gitBranch` | Current branch; **empty string**, not null, when cwd isn't a repo. | +| `sessionId` | Must match the file's session UUID (or be a bridge entry — see below). | +| `uuid` | UUIDv4 per entry. | +| `timestamp` | ISO-8601 with millisecond precision. | +| `parentUuid` | The prior entry's `uuid` (or `null` for the first entry). | +| `isSidechain` | Explicit `false` on non-sidechain entries. | + +## Structural rules + +### Preserve `tool_use` / `tool_result` pairing + +Every `tool_use` part in an assistant entry must be followed by a +corresponding user entry carrying a `tool_result` part with the same +`tool_use_id`. Assistant entries whose `stop_reason` is `"tool_use"` +without a matching result in the following user entry confuse the +loader. + +If you're constructing synthetic conversation state with tool calls, +you must synthesize plausible results. + +### Preserve `thinking` signatures + +If you include `thinking` parts in assistant content, each must carry +a valid `signature`. A `thinking` part without a signature (or with a +signature that doesn't verify) will be **dropped** when the session is +resumed — the API won't replay it to the model. If you're constructing +synthetic state that includes thinking, consider whether you actually +need it; constructing a valid signature is not possible without an +Anthropic-signed source. + +### Assistant entries typically have non-null `model`, `id`, `type` + +- `model` — e.g. `"claude-opus-4-6"`. +- `id` — Anthropic-format API message ID (prefix `msg_`). Synthesized + IDs should follow the shape `msg_`. +- `type: "message"` — literal. + +Omitting these may render OK in the transcript but break resume. + +### `stop_reason` tolerance + +`stop_reason` is often `null` on real entries (written before the +stream finalizes), so the loader tolerates `null`. Safe values for +synthetic entries: `"end_turn"`, `"tool_use"`, `null`. + +## Round-trip rules + +If you read a session and want to write it back out losslessly: + +- **Preserve unknown fields.** A field you don't recognize on read must + be round-tripped on write. New Claude Code fields arrive + additively. +- **Preserve field order at your peril.** The loader doesn't care, but + diff tools and humans do. Emit envelope fields before `message` + for readability. +- **Preserve `content` array ordering.** Text-before-tool_use and + thinking-before-text within a single assistant entry matter for + replay semantics. + +## Starting a fresh session file + +Minimum ingredients to make `claude --resume ` pick up a new +session: + +1. Directory `~/.claude/projects//` exists. +2. File `.jsonl` exists inside it. +3. First line is a valid `permission-mode` entry for that session + (three fields only). +4. Subsequent lines form a valid conversation — at minimum, a single + `user` entry with the envelope fields from the strong-conventions + table and a string or array `message.content`. + +That's enough for Claude Code to see the session. A functioning +"continuable" session needs at least one assistant turn with a clean +`stop_reason` for resume to know where to pick up. + +## What we haven't verified + +- Exact tolerance for truncated or missing `usage` objects on + synthetic assistant entries. +- Whether `toolUseResult` top-level summaries are ever consulted by + the loader, or are purely for UI/debugging. +- How strict the loader is about `messageId` uniqueness across + `file-history-snapshot` entries (given known collision bugs exist + in Anthropic's own output). + +When in doubt: copy a real entry's envelope shape field-for-field. + +## Reference implementation + +`path incept` is the in-tree tool that produces Claude-compatible JSONL +from a Toolpath document. Its implementation is the working example of +every rule in this doc: + +- CLI entry point: [`crates/toolpath-cli/src/cmd_incept.rs`](../../../../crates/toolpath-cli/src/cmd_incept.rs) +- Projection logic: `ClaudeProjector` in + [`crates/toolpath-claude/src/project.rs`](../../../../crates/toolpath-claude/src/project.rs) + (implements `toolpath_convo::ConversationProjector`) + +When you add a new rule here, update `cmd_incept.rs` or `project.rs` +in the same change; when you discover a rule empirically by fixing a +bug there, capture it here.