A benchmark and taxonomy for analyzing reasoning errors in Large Language Models.
- Incorrect responses are 53% longer than correct ones (Cohen's d=0.91, p<0.001)
- Error types are domain-specific: Math→Computational, Commonsense→Knowledge, Code→Strategy
- Scaling helps but doesn't solve everything: 70B models eliminate some error types but increase others
| Metric | Value |
|---|---|
| Total Problems | 200 |
| Model Responses | 400 |
| Annotated Errors | 52 |
| Error Categories | 6 |
| Error Types | 15 |
pip install -r requirements.txt# 1. Generate problems
python src/generate_problems.py --output data/problems.json
# 2. Collect responses (requires API key)
export GROQ_API_KEY=your_key_here
python src/collect_data.py --problems data/problems.json --output data/traces_raw.json
# 3. Annotate errors
python src/annotate.py --input data/traces_raw.json --output data/annotations_human.json --only-incorrect
# 4. Analyze results
python src/analyze.py --traces data/traces_raw.json --annotations data/annotations_human.json --output results/
# 5. Generate figures
python src/make_figures.py --results results/analysis_results.json --output figures/| Category | Code | Description |
|---|---|---|
| Computational | COMP | Math/calculation errors |
| Knowledge | KNOW | Wrong facts or formulas |
| Logical | LOGIC | Invalid reasoning |
| Comprehension | COMPR | Misunderstanding problems |
| Strategy | STRAT | Wrong approach |
| Output | OUT | Formatting issues |
| Model | Math | Logic | Common | Code | Overall |
|---|---|---|---|---|---|
| LLaMA-8B | 50% | 30% | 22% | 2% | 26% |
| LLaMA-70B | 68% | 42% | 22% | 6% | 34.5% |
Incorrect responses average 353 words vs 231 for correct (p<0.001).
@article{reasonerrorbench2026,
title={ReasonErrorBench: A Taxonomy-Driven Analysis of Reasoning Errors in Large Language Models},
author={Tate Lyman},
journal={arXiv preprint arXiv:2601.XXXXX},
year={2026}
}MIT License - see LICENSE file.
