A spec-driven checklist generation engine with deterministic output, quality evaluation, and a self-learning improvement loop.
# Install dependencies (requires Python ≥ 3.9)
poetry install
# Run all tests
pytest -q
# Or use the test wrapper (writes summary to artifacts/)
./scripts/run_tests.shCI workflows live under .github/workflows/:
| Workflow | Purpose |
|---|---|
spec-pipeline.yml |
Full test suite + spec determinism checks |
selfbuild.yml |
Self-host dry-run against the canonical spec |
reproducibility.yml |
Determinism regression guard |
release-smoke.yml |
Release smoke tests |
# Run all tests
pytest -q
# Run a specific test file
pytest tests/spec/test_schema_compliance.py -v
# Run just the learning-trial generated tests
pytest tests/generated/ -v
# Run fuzz property tests (500 random specs)
pytest tests/property/ -v
# Run with test wrapper (writes summary to artifacts/)
./scripts/run_tests.sh| Layer | Module | Responsibility |
|---|---|---|
| Spec DSL | dsl/ |
Canonical JSON spec loader and schema validator |
| AST | services/ast/ |
Builds abstract syntax tree from loaded specs |
| Checklist | services/checklist/ |
Generates task checklists with lineage tracking |
| Quality | checklist/quality.py |
Evaluates VALID/INVALID per item; enforces imperative-verb and normative-keyword rules |
| Guidance | services/guidance/ |
Enriches items with confidence, evidence, and action fields |
| Completion | requirements/completion.py |
Measures per-requirement coverage via token overlap |
| Sufficiency | sufficiency/evaluator.py |
Rolls up completeness, quality, and readiness grade |
| Interpreter | interpreter/ |
Prose-text fallback: extracts checklist items from free-form text |
| Code Generators | generators/ |
Template-based emitters for Python/TS/Go/Rust/Flask/FastAPI/Next.js/Svelte |
| Learning loop | learning/ |
Gap analysis, improvement-prompt formatter, dynamic test generator |
The engine ships a self-learning evaluation pipeline that:
- Generates a parametric corpus of specs at four maturity tiers (prose → structured → partial DSL → full DSL)
- Runs the engine over each spec and evaluates checklist quality + requirement completeness
- Classifies gaps by source module and writes executable failing tests as the improvement backlog
- Optionally posts gap prompts to an LLM endpoint (
--ai-mode)
# Generate the spec corpus (32 fixed variants)
python scripts/spec_synthesizer.py --out spec_trials/generated
# Run one learning iteration
SHIELDCRAFT_SELFBUILD_ALLOW_DIRTY=1 python scripts/run_learning_trial.py spec_trials/generated
# Fuzz evaluation: 500 random specs, generate property tests
SHIELDCRAFT_SELFBUILD_ALLOW_DIRTY=1 python scripts/run_fuzz_suite.py --n 500 --gen-testsCurrent status: converged — corpus_score=1.000, 32/32 specs met, 2000/2000 random fuzz specs pass.
See docs/LEARNING_STRATEGY.md for full documentation.
See docs/ for detailed documentation. Key files:
- docs/LEARNING_STRATEGY.md — self-learning loop design and usage
- docs/engine/ENGINE_CONTRACT.md — engine output guarantees
- docs/CHANGELOG.md — version history
- docs/governance/ — governance contracts and decision log