Skip to content

feat: add chainweaver diff <a> <b> CLI for step-by-step trace comparison (#148)#160

Open
dgenio wants to merge 1 commit into
feat/147-cli-profile-subcommandfrom
feat/148-cli-diff-subcommand
Open

feat: add chainweaver diff <a> <b> CLI for step-by-step trace comparison (#148)#160
dgenio wants to merge 1 commit into
feat/147-cli-profile-subcommandfrom
feat/148-cli-diff-subcommand

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 16, 2026

Summary

Adds chainweaver diff <a.json> <b.json> so operators can compare two ExecutionResult JSON files step-by-step. Sister tool to profile (#159).

Stacked on top of #159 (profile CLI); base cascades through #159 → #158 → #157 → main as those merge.

Closes #148.

Changes

  • pyproject.toml — adds deepdiff>=8.0 to [project.dependencies]. Justification below.
  • chainweaver/cli.py — new diff_command, _compare_traces (structural comparison), _step_outputs_diff (DeepDiff-backed), _format_diff_table (human renderer).
  • tests/test_cli_diff.py — new file, 13 test cases.

Behavior

Aligns step records by position. Walks outputs, error_type, error_message, success for each pair. Optionally flags per-step duration regressions beyond --perf-tolerance N%. Non-deterministic fields (trace_id, started_at, ended_at, total_duration_ms, per-step duration_ms when no tolerance is set) are ignored by default.

Flag Meaning
--perf-tolerance N Flag steps whose duration_ms changed by more than N %. Off by default.
--format table|json (-f) Default table shows structural deltas; json emits the structured diff payload.

Exit codes

  • 0 — identical (modulo ignored fields).
  • 1 — differs, or malformed trace input.
  • 2 — file not found.

Testing

  • Linting passes (ruff check chainweaver/ tests/ examples/)
  • Formatting check passes (ruff format --check chainweaver/ tests/ examples/)
  • Type checking passes (python -m mypy chainweaver/ tests/)
  • All existing tests pass — 504/504 passed in 1.99s (491 pre-existing + 13 new)
  • New tests added for new functionality
$ ruff check chainweaver/ tests/ examples/
All checks passed!
$ ruff format --check chainweaver/ tests/ examples/
51 files already formatted
$ python -m mypy chainweaver/ tests/
Success: no issues found in 44 source files
$ python -m pytest tests/ -q --no-cov
504 passed in 1.99s

Diff stat: 3 files changed, 580 insertions(+).

Related Issues

Closes #148. Sister to chainweaver profile (#159 / #147).

Checklist

  • Code follows project conventions (see AGENTS.md and docs/agent-context/)
  • Public API changes are documented — CLI docstring updated; one new runtime dep documented below
  • No secrets or credentials included

Tradeoffs / risks

  • New runtime dependency: deepdiff>=8.0. Justification: hand-rolling recursive nested-dict diff would add ~150 LoC of fragile code, and the issue body specifically calls for "JSON-aware diff". DeepDiff is small (~150 KB), well-maintained, has stable cross-platform wheels for Python 3.10–3.13, and supports the tree view with to_dict() for JSON-safe output. This brings the runtime-deps total from 4 → 5 (still well within the "lean dep set" spirit; all five are well-known and CLI-essential). Cleared per the relaxed-constraints answer ("I don't mind adding new dependencies").
  • DeepDiff API surface is large; we only use DeepDiff(a, b, ignore_order=True, view="tree").to_dict(). Wrapping it in _step_outputs_diff keeps the surface area minimal so a future swap is local.
  • Performance-tolerance off by default: matches the issue's "non-deterministic fields ignored by default" framing. Users opt in to perf checks explicitly.
  • Step alignment is positional: tool-name renames at the same index get flagged via tool_name_change rather than being treated as separate insert/delete events. This is the simplest reasonable contract for now; reordered-steps detection is out of scope.

Scope notes

Closes #148 only. Adjacent items:

  • DeepDiff for profile — the profile verb could use it too for richer output rendering. Out of scope here; profile lands first and its current statistics-based approach is fine.
  • Replay-from-diff helper — once chainweaver diff lands, a follow-up could let chainweaver replay --diff re-execute only the diverging steps. Tracked separately if needed.

https://claude.ai/code/session_01QcSJ3NWhe5B4k1EP25Hx3n


Generated by Claude Code

Closes #148.

Compares two ExecutionResult JSON files step-by-step. Aligns step
records by position; walks outputs / error_type / error_message /
success; optionally flags per-step duration regressions beyond a
configurable threshold. Non-deterministic fields (trace_id, timestamps,
total/per-step durations) are ignored by default.

Usage:

    chainweaver diff yesterday.json today.json
    chainweaver diff base.json candidate.json --perf-tolerance 25
    chainweaver diff a.json b.json --format json

Exit codes:
- 0 — identical (modulo ignored fields).
- 1 — differs, or malformed input.
- 2 — file not found.

Implementation:
- chainweaver/cli.py — new `diff_command`, `_compare_traces` (structural
  comparison), `_step_outputs_diff` (DeepDiff-backed), and
  `_format_diff_table` (human-readable renderer).
- DeepDiff is a new required runtime dependency (`deepdiff>=8.0`).
  Hand-rolling recursive dict diff would add ~150 LoC of fragile code;
  DeepDiff is small, well-maintained, and matches the issue's
  "JSON-aware diff" requirement out of the box.

Tests: 13 new cases in tests/test_cli_diff.py covering:
- Identity: identical traces with different trace_ids return exit 0,
  JSON output shape stable.
- Divergence: different flow_names, diverging step outputs (table +
  JSON), error vs success transitions, mismatched step counts.
- Performance tolerance: within / exceeds / off-by-default semantics.
- File errors: missing first file (exit 2), missing second file
  (exit 2), malformed trace (exit 1).

Verification:
  $ ruff check chainweaver/ tests/ examples/      # All checks passed
  $ ruff format --check chainweaver/ tests/ ...   # 51 files already formatted
  $ python -m mypy chainweaver/ tests/            # Success: no issues
  $ python -m pytest tests/ -q --no-cov           # 504 passed in 1.99s

Stacked on top of #159 (profile CLI); base cascades through
#159#158#157 → main as those merge.

https://claude.ai/code/session_01QcSJ3NWhe5B4k1EP25Hx3n
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants