Byzantine-fault-tolerant finality control for LLM-agent collaboration.
Reference implementation, benchmarks, and reproduction scripts for the paper:
Hierarchical Certified Semantic Commitment: Typed Finality Control for Byzantine-Robust LLM-Agent Collaboration. Haoran Xu. arXiv:XXXX.XXXXX (2026). (update with the arXiv ID once live)
Given n LLM agents of which up to f may be Byzantine, and a round of structured
natural-language proposals, H-CSC decides what kind of finality the round supports
and emits exactly one typed, certificate-backed outcome:
semantic_commit— a2f+1within-verdict semantic core backs the verdict; the protocol emits a parameter-bound digest over the quantised embedding aggregate.verdict_commit— the verdict has a2f+1quorum and a margin, but the semantic rationale is dispersed; a verdict-level certificate is emitted with no semantic aggregate.abort— neither finality signal is admissible; an explicit typed reason is returned.
Every outcome carries the same 2f+1 distinct-signer certificate envelope; only the
underlying digest differs. The contribution is typed finality, not raw commit accuracy.
hcsc-release/
├── fba/ # Core H-CSC protocol (typed commitment, certificates,
│ # semantic core, geometric-median aggregation, digest)
├── bench/ # MVR-50 real-agent benchmark engine (Climate-FEVER)
│ ├── scripts/ # honest/Byzantine generation, commitment, analysis
│ ├── scripts/coverage_recovery/ # baselines B0–B3 + bootstrap CIs + figures
│ ├── configs/ prompts/ examples/ # run configs, attack/agent prompts, schema samples
│ └── data/ # frozen tasks/views + frozen LLM proposals (for cached repro)
├── experiments_commitment/ # BCS_v1 controlled-diagnostic runners
├── experiments_corrected/ # corrected-pipeline audits + topology design-space ablation
├── tests_commitment/ tests_corrected/ tests_bench/ # test suites (all CPU, offline)
├── scripts/ # one-command reproduction scripts
└── data/ # download_artifacts.py → fetch CRSE checkpoint + datasets
The protocol core (fba/) and benchmark (bench/) form one importable workspace:
run everything with the repo root on PYTHONPATH (the scripts set PYTHONPATH=.).
git clone https://github.com/HrxuAlbert/H-CSC.git
cd H-CSC
pip install -r requirements.txtCPU-only is sufficient for the headline reproduction below.
The headline MVR-50 results are reproducible from cached embeddings shipped in
bench/data/results/coverage_recovery/embeddings_cache.npz — no model download, no API
spend. This regenerates the per-task outcomes, bootstrap CIs, and the trade-off figure:
export PYTHONPATH=.
python3 -m bench.scripts.coverage_recovery.run_coverage_recovery_variants
python3 -m bench.scripts.coverage_recovery.bootstrap_coverage_recovery_ci
python3 -m bench.scripts.coverage_recovery.plot_hcsc_tradeoff # writes the forest plot PDF to a local (gitignored) output dirRun the test suites (all offline, CPU):
PYTHONPATH=. python3 tests_commitment/run_tests_commitment.py # 57 protocol tests
PYTHONPATH=. python3 tests_bench/run_tests_bench.py # benchmark testsTo reproduce the exact paper numbers (CRSE embeddings, not the base-encoder fallback) and to re-run the commitment pipeline over the frozen proposals, fetch the externally-hosted artifacts first:
python3 data/download_artifacts.py # CRSE checkpoint + BCS_v1 / dataset files
bash scripts/repro_mvr50_from_cache.sh # CPU-only; re-runs commitment from frozen proposals
bash scripts/repro_bcs_only.sh # BCS_v1 controlled diagnosticSee data/download_artifacts.py for the externally-hosted artifacts and their URLs.
export OPENAI_API_KEY=... # and/or ANTHROPIC_API_KEY / OPENROUTER_API_KEY
# edit a config under bench/configs/ (set api_calls_enabled: true), then:
python3 -m bench.scripts.run_mvr_variant --config bench/configs/real_agent_mvr50.yamlGeneration is cache-first: existing responses are never re-requested. See
bench/api_cost_plan.md for the cost model.
GitHub hosts code + small frozen inputs needed to run and reproduce the benchmark only.
The CRSE checkpoint (~419 MB) and the BCS_v1 / Climate-FEVER source datasets (multi-GB)
are hosted externally; data/download_artifacts.py fetches them into the paths the code
expects (Colab_trained_Model/500pt_best_model.pt, data/). Set the URLs at the top of
that script once the Zenodo/Hugging Face records are minted.
@article{xu2026hcsc,
title = {Hierarchical Certified Semantic Commitment: Typed Finality Control
for Byzantine-Robust LLM-Agent Collaboration},
author = {Xu, Haoran},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}MIT — see LICENSE.