Skip to content

chore: add pre-commit config mirroring AGENTS.md §7 validation commands (#137)#165

Open
dgenio wants to merge 4 commits into
mainfrom
claude/triage-issues-7JgBE
Open

chore: add pre-commit config mirroring AGENTS.md §7 validation commands (#137)#165
dgenio wants to merge 4 commits into
mainfrom
claude/triage-issues-7JgBE

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 16, 2026

Adds .pre-commit-config.yaml at the repo root with hooks that match
the four validation commands verbatim — same paths, same flags — so a
clean pre-commit run --all-files is the strongest local signal that
CI will pass.

Hooks:

  • ruff check chainweaver/ tests/ examples/
  • ruff format --check chainweaver/ tests/ examples/
  • python -m mypy chainweaver/ tests/
  • gitleaks (secret scanning; richer than detect-private-key alone)
  • actionlint (syntax check for .github/workflows/*.yml)
  • stdlib pre-commit-hooks (trailing-whitespace, end-of-file-fixer,
    check-yaml, check-toml, check-merge-conflict, check-added-large-files,
    detect-private-key)

All hook versions are pinned. CI workflows are unchanged.

CONTRIBUTING.md gains a "Pre-commit hooks" subsection documenting
install steps and the no-bypass policy. .github/copilot-instructions.md
cross-links to it.

Closes #137

claude added 4 commits May 16, 2026 16:46
…ds (#137)

Adds `.pre-commit-config.yaml` at the repo root with hooks that match
the four validation commands verbatim — same paths, same flags — so a
clean `pre-commit run --all-files` is the strongest local signal that
CI will pass.

Hooks:
- ruff check chainweaver/ tests/ examples/
- ruff format --check chainweaver/ tests/ examples/
- python -m mypy chainweaver/ tests/
- gitleaks (secret scanning; richer than detect-private-key alone)
- actionlint (syntax check for .github/workflows/*.yml)
- stdlib pre-commit-hooks (trailing-whitespace, end-of-file-fixer,
  check-yaml, check-toml, check-merge-conflict, check-added-large-files,
  detect-private-key)

All hook versions are pinned. CI workflows are unchanged.

`CONTRIBUTING.md` gains a "Pre-commit hooks" subsection documenting
install steps and the no-bypass policy. `.github/copilot-instructions.md`
cross-links to it.

Closes #137
Adds a mechanical guard against accidental public-API breakage.
`tests/test_public_api_snapshot.py` compares the live
`chainweaver.__all__` surface against a checked-in golden file
(`tests/fixtures/public_api.json`); CI fails if any of these drift
without an accompanying regen:

- a symbol added to or removed from `__all__`,
- a class's public attribute or method shape,
- a public function's signature or return annotation,
- a Pydantic model's field set or field types.

The introspection helper (`tests/api_surface.py`) uses griffe to
extract surface info from source AST rather than runtime objects, so
the snapshot is stable across Python versions (no `repr(int | str)` vs
`repr(Union[int, str])` skew). `griffe>=2.0` lands in the `[dev]`
extra.

Run `python tests/scripts/regen_public_api.py` after an intentional API
change to regenerate the fixture; the diff travels in the same PR as
the surface delta and reviewers can read it directly.

Three acceptance-criteria smoke tests pass:
- Removing a symbol from `__all__` fails CI.
- Changing a public function's signature fails CI.
- Adding/removing a Pydantic model field fails CI.

`docs/versioning-policy.md` gains a "Public-API snapshot guard"
section; `docs/agent-context/review-checklist.md` gains a regen
reminder.

Closes #140
Adds three property test families covering the headline "compiled, not
interpreted; same input + same tools = same output" claim:

1. Idempotence — `execute_flow(F, x)` produces identical `final_output`
   and identical per-step `outputs` across N successive runs (volatile
   fields: `trace_id`, timestamps, durations excluded by name).
2. Serialization round-trip — `flow_from_yaml(flow_to_yaml(F)).execute(x)`
   matches `F.execute(x)`. Same for the JSON path.
3. DAG-equivalence — a linear `Flow` and the trivially-sequential
   `DAGFlow` (one node per step, `depends_on` chain) produce identical
   `final_output`.

Strategies are derived from `helpers.py` schemas via
`hypothesis_jsonschema.from_schema` (no hand-coded shape — strategies
stay in sync with the Pydantic models). Flow shape is generated from
the existing helper toolbelt (`double`, `add_ten`, `format_result`);
arbitrary Pydantic schemas at runtime are intentionally out of scope.

Property settings: `max_examples=50`, `deadline=200ms` per the issue.
Total wall-clock cost in the default test run: ~5 s.

Smoke-tested: injecting `random.random()` into `_double_fn` causes
property test (1) to fail and Hypothesis shrinks to a minimal
counter-example.

`pyproject.toml` adds `hypothesis>=6.150` and `hypothesis-jsonschema>=0.23`
to `[dev]` (consolidated into the existing extra rather than a separate
`[test]` extra — single install surface). Marker `property` registered.
`pythonpath = ["tests", "tests/property"]` exposes `helpers` and
`strategies` as bare-name imports.

CI adds a `pytest -m property --hypothesis-show-statistics` step on the
Ubuntu / 3.10 lane so the Hypothesis seed is preserved in CI logs for
reproduction. Property tests also run as part of the default matrix
test step on all 12 jobs.

`docs/agent-context/workflows.md` documents the property-test
conventions.

Closes #143
Adds a new CI workflow that fails PRs whose median wall-clock for the
naive-vs-compiled benchmark regresses beyond 125% of the gh-pages
baseline. Wall-clock is the right signal for the "compiled, not
interpreted" claim; CodSpeed-style instrumentation is overkill here.

Workflow (`.github/workflows/bench.yml`):
- Pinned to `ubuntu-22.04` (glibc-driven variance — see
  https://codspeed.io/blog/unrelated-benchmark-regression).
- No matrix (macOS / Windows wall-clock too noisy for a hard gate).
- Invokes `benchmark-action/github-action-benchmark@v1` with
  `tool: customSmallerIsBetter`, `auto-push: true`,
  `alert-threshold: 125%`, `fail-on-alert: true`.
- Permissions: contents:write (auto-push to gh-pages),
  pull-requests:write (regression comment on PRs).

Bench script (`benchmarks/bench_naive_vs_compiled.py`):
- New `--repeats N` flag with median-of-N reporting per case.
  Default N=1 preserves the pre-#144 behavior.
- New `--benchmark-action-output PATH` flag that writes a flat JSON
  array in the customSmallerIsBetter schema the action consumes
  directly (no jq transform in the workflow).
- Variance check on 5 consecutive runs (steps=50, llm=10ms): naive
  delta < 1%, compiled delta < 10% — well below the 125% threshold.
- CI invocation uses `--steps 50 --llm-ms 10 --tool-ms 0 --repeats 5`
  (~2.7 s wall-clock; signal well above runner noise floor).

`benchmarks/baseline.json` is checked in as a local sanity reference,
not as the CI comparison source (gh-pages owns that). Regenerated only
when an intentional perf change lands.

`benchmarks/README.md` documents:
- The CI failure semantics (when/why the guard fails).
- OS pinning rationale (no macOS / Windows matrix).
- The one-off `gh-pages` initialization step (maintainer-only).
- The local refresh-baseline workflow.

`AGENTS.md` mentions the new bench gate in the CI section.

NOTE: `gh-pages` branch initialization is a one-time maintainer
bootstrap step that this PR cannot perform on its own; see
`benchmarks/README.md` § "One-off gh-pages initialization" for the
exact commands. The first `bench.yml` run on `main` after that will
seed the initial benchmark dataset.

Closes #144
Copilot AI review requested due to automatic review settings May 16, 2026 17:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds local validation infrastructure, but the diff also expands CI/test coverage with public-API snapshots, Hypothesis property tests, and a benchmark regression workflow.

Changes:

  • Adds .pre-commit-config.yaml and contributor docs for local hooks.
  • Adds public API snapshot generation/testing and property-based executor/serialization tests.
  • Adds benchmark median reporting, baseline docs/data, and a new benchmark CI workflow.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
.pre-commit-config.yaml Defines local lint/type/security/workflow hooks.
.github/copilot-instructions.md Links to pre-commit contributor docs.
.github/workflows/ci.yml Adds a dedicated property-test CI step.
.github/workflows/bench.yml Adds benchmark regression workflow.
AGENTS.md Documents benchmark workflow behavior.
CONTRIBUTING.md Documents pre-commit installation and policy.
pyproject.toml Adds test/dev dependencies and pytest config.
docs/versioning-policy.md Documents public API snapshot guard.
docs/agent-context/workflows.md Documents property-test conventions.
docs/agent-context/review-checklist.md Adds public API snapshot review item.
tests/api_surface.py Adds public API introspection helper.
tests/test_public_api_snapshot.py Adds snapshot tests for exported API surface.
tests/fixtures/public_api.json Adds generated public API fixture.
tests/scripts/__init__.py Adds scripts package marker.
tests/scripts/regen_public_api.py Adds fixture regeneration script.
tests/property/strategies.py Adds Hypothesis strategies and flow builders.
tests/property/test_dag_equivalence.py Adds linear/DAG equivalence property test.
tests/property/test_idempotence.py Adds executor idempotence property tests.
tests/property/test_roundtrip.py Adds YAML/JSON round-trip property tests.
benchmarks/bench_naive_vs_compiled.py Adds repeats/median reporting and benchmark-action output.
benchmarks/baseline.json Adds sample benchmark baseline data.
benchmarks/README.md Documents benchmark CI guard and baseline workflow.

Comment thread .pre-commit-config.yaml
Comment on lines +3 to +4
# These hooks mirror AGENTS.md §7 "Validation commands" exactly so the
# local gate matches CI. Run `pre-commit install` once after cloning,
Comment thread .pre-commit-config.yaml
- pydantic>=2.0
- tenacity>=8.0
- typer>=0.9
- types-pyyaml>=6.0
@@ -0,0 +1,69 @@
name: Bench
Comment thread .github/workflows/ci.yml
Comment on lines +53 to +55
- name: Property tests (Hypothesis seed reporting)
if: ${{ matrix.os == 'ubuntu-latest' && matrix.python-version == '3.10' }}
run: python -m pytest tests/ -m property --no-cov --hypothesis-show-statistics -v
Comment thread tests/api_surface.py
Comment on lines +63 to +68
defaults) and public method signatures.
"""
attributes: dict[str, dict[str, Any]] = {}
methods: dict[str, dict[str, Any]] = {}
for member_name in sorted(cls.members):
if member_name.startswith("_"):
Comment thread CONTRIBUTING.md

---

## Pre-commit hooks
Comment thread .pre-commit-config.yaml
- id: ruff
name: ruff check (chainweaver/ tests/ examples/)
args:
- check
tool: customSmallerIsBetter
output-file-path: bench-result.json
github-token: ${{ secrets.GITHUB_TOKEN }}
auto-push: true
Comment thread .pre-commit-config.yaml
Comment on lines +29 to +47
# ---------------------------------------------------------------
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.6
hooks:
- id: ruff
name: ruff check (chainweaver/ tests/ examples/)
args:
- check
- chainweaver/
- tests/
- examples/
pass_filenames: false
- id: ruff-format
name: ruff format --check (chainweaver/ tests/ examples/)
args:
- --check
- chainweaver/
- tests/
- examples/
Comment thread .pre-commit-config.yaml
Comment on lines +53 to +66
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.13.0
hooks:
- id: mypy
name: python -m mypy chainweaver/ tests/
args:
- chainweaver/
- tests/
pass_filenames: false
additional_dependencies:
- pydantic>=2.0
- tenacity>=8.0
- typer>=0.9
- types-pyyaml>=6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add .pre-commit-config.yaml mirroring the four validation commands

3 participants