Skip to content

TateLyman/Reason-Error-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReasonErrorBench

A benchmark and taxonomy for analyzing reasoning errors in Large Language Models.

arXiv License: MIT

Key Findings

  1. Incorrect responses are 53% longer than correct ones (Cohen's d=0.91, p<0.001)
  2. Error types are domain-specific: Math→Computational, Commonsense→Knowledge, Code→Strategy
  3. Scaling helps but doesn't solve everything: 70B models eliminate some error types but increase others

Quick Stats

Metric Value
Total Problems 200
Model Responses 400
Annotated Errors 52
Error Categories 6
Error Types 15

Installation

pip install -r requirements.txt

Usage

Run Full Pipeline

# 1. Generate problems
python src/generate_problems.py --output data/problems.json

# 2. Collect responses (requires API key)
export GROQ_API_KEY=your_key_here
python src/collect_data.py --problems data/problems.json --output data/traces_raw.json

# 3. Annotate errors
python src/annotate.py --input data/traces_raw.json --output data/annotations_human.json --only-incorrect

# 4. Analyze results
python src/analyze.py --traces data/traces_raw.json --annotations data/annotations_human.json --output results/

# 5. Generate figures
python src/make_figures.py --results results/analysis_results.json --output figures/

Error Taxonomy

Category Code Description
Computational COMP Math/calculation errors
Knowledge KNOW Wrong facts or formulas
Logical LOGIC Invalid reasoning
Comprehension COMPR Misunderstanding problems
Strategy STRAT Wrong approach
Output OUT Formatting issues

Results

Model Accuracy by Domain

Model Math Logic Common Code Overall
LLaMA-8B 50% 30% 22% 2% 26%
LLaMA-70B 68% 42% 22% 6% 34.5%

Response Length Effect

Response Length

Incorrect responses average 353 words vs 231 for correct (p<0.001).

Citation

@article{reasonerrorbench2026,
  title={ReasonErrorBench: A Taxonomy-Driven Analysis of Reasoning Errors in Large Language Models},
  author={Tate Lyman},
  journal={arXiv preprint arXiv:2601.XXXXX},
  year={2026}
}

License

MIT License - see LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors