Skip to content

Starscream-11813/awesome-AI4Math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

🧮 AI for Mathematical Reasoning

An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Awesome arXiv Citation License Last updated Papers Visitors

A companion resource hub for the survey paper. This page indexes every system, dataset, benchmark, and methodology the survey discusses, organised by the same four-axis taxonomy: informal text-only · multimodal · formal · discovery.


TL;DR

What began in the mid-1960s as brittle pattern-matching programs for templated arithmetic has — within the past four years — expanded into a landscape of AI systems that solve Olympiad-level problems, formally verify mathematical arguments in Lean 4, and contribute to the resolution of open problems posed by Paul Erdős. This survey provides an integrated account of that arc, connecting the classical MWP lineage, multimodal geometry, formal theorem proving, and verified mathematical discovery through the reasoning-model era of 2024–2026.

Three things distinguish this survey:

  1. Four-axis taxonomy — informal text, multimodal, formal, discovery — covered as co-equal axes rather than as a single LLM-centric thread.
  2. Supervision-ladder framework — a unified reading of the field's progression from hand-coded schemata through formal proof-assistant kernels as a sequence of increasingly informative external verifiers.
  3. Comprehension–Generation–Verification (CGV) triad — a refinement of recent two-stage LLM-centric framings, promoting verification from an optional post-hoc filter to a co-equal training-time component.

Contents


Visual Overview

Paradigm chronology — seven swimlanes from STUDENT (1965) to verified discovery (2026):

Chronology of AI systems for mathematical reasoning

Figure: Chronology of AI systems for mathematical reasoning, organised as paradigm swimlanes. Early symbolic and statistical MWP solvers (1965–2016, compressed axis) yield to neural expression generators (2017–2020), prompted and tool-augmented LLMs (2021–2023), and the 2024–2026 convergence of RLVR-trained reasoning models, Lean-based proof assistants, multimodal geometers, and verified-discovery systems. Stars mark IMO-medal milestones.

Methodology funnel — 1,847 candidate records → 238 included:

Methodology Sankey diagram

Benchmark Saturation, 2021–2026

Benchmark saturation trends

Three patterns: (i) GSM8K and MATH have effectively saturated — both clear 95% by 2025; (ii) AIME 2024, FrontierMath, and PutnamBench show a discontinuity at the onset of the reasoning-model era; (iii) formal proving has moved fastest in relative terms — MiniF2F-test rose from ~25% (2022) to 93% (2026); PutnamBench from near-zero to over 60% in just over a year.

Benchmark Pre-DL Seq2Tree LLM era Reasoning era
MAWPS ~60% 92.0% >95% >99%
GSM8K 55% 94.6% (GPT-4o) 97.3% (Kimi K2)
MATH-500 6.9% (GPT-3) 52% (GPT-4) 99.2% (LongCat)
AIME 2024 12% (GPT-4o) 91.6% (o3)
AIME 2025 94.6% (SU-01 / GPT-5.5 H)
GPQA-D 53.6% (GPT-4o) 94.6% (Claude Mythos)
FrontierMath <2% (GPT-4o) 48% *(co-math. T4)*ᶧ
MiniF2F-test 50% (DSP-V1) 99.6% (AlphaProof + TTRL)
PutnamBench 86/658 (Goedel-V2-32B)
IMO-ProofBench 80.7% (GPT-5.5 H)

ᶧ Tier-4 (research grade); other rows are Tier 1–3 average.


The Four Axes

Axis Defining task Representative artefact Verifier
Informal text-only MWPs, competition QA Numeric answer, CoT trace Exact-match / answer parser
Multimodal & geometry Diagram + text problems Construction, proof sketch Symbolic geometry solver, MLLM grader
Formal proving Theorems in Lean / Coq / Isabelle Tactic script, proof term Proof-assistant kernel
Mathematical discovery Open problems, improved bounds Program, construction, verified theorem Domain evaluator + Lean kernel + expert audit

Section I — Math Word Problem Solving

Rule-based & Statistical

Tree-based & Semantic Parsing

Neural MWP (Seq2Seq, GTS, Graph2Tree)

Robustness & Probing Benchmarks


Section II — The LLM and Reasoning-Model Era

This section corresponds to the LLMs & Agents and Reasoning & Verification bands of the chronology figure.

Prompting-Era Innovations

Tool-Integrated Reasoning

Self-Improvement & Bootstrapping

Multi-Agent & Agentic Reasoning

Math-Specialized Foundation Models

Process / Outcome Reward Models

Reasoning Models (o-series, R1, Kimi k1.5, SU-01, …)

RL Algorithms for Reasoning


Section III — Multimodal & Geometry

Classical Geometry Solvers

Neuro-symbolic & Neural GPS

Olympiad Geometry (AlphaGeometry family, TongGeometry)

Vision–Language Models for Math


Section IV — Formal Theorem Proving

Tactic Prediction & Search

Retrieval-Augmented Proving

Whole-Proof Generation

Expert Iteration + RL Provers

Autoformalisation & Library Building

Compiler-Guided Repair


Section V — Mathematical Discovery & Open Problems

Program Search

Erdős Workflow & AI-Assisted Solves

Research Workbench Mode


Section VI — Datasets & Benchmarks

Classical MWP

Dataset Size Lang. Year Paper
AI2 (AddSub) 395 EN 2014 PDF
SingleEQ 508 EN 2015 PDF
Alg514 514 EN 2014 PDF
Dolphin18K 18,460 EN 2016 PDF
MAWPS 3,320 EN 2016 PDF
ASDiv 2,305 EN 2020 PDF
AQuA 97,975 EN 2017 PDF
Math23K 23,162 ZH 2017 PDF
Ape210K 210,488 ZH 2020 PDF
MathQA 37,297 EN 2019 PDF
SVAMP 1,000 EN 2021 PDF
ParaMAWPS 16,278 EN 2023 PDF
GSM-Symbolic 5,000 EN 2024 PDF

Competition & Olympiad

Dataset Size Year Paper
GSM8K 8,792 2021 PDF
MATH 12,500 2021 PDF
OlympiadBench 8,476 2024 PDF
Omni-MATH 4,428 2024 PDF
FrontierMath ~350 2024 PDF
MathArena 149+/yr (live) 2025 PDF
IMO-ProofBench 60 2025 PDF

Multilingual Math

Dataset Coverage Year Paper
MGSM 10 langs · 2.5K 2022 PDF
HAWP Hindi · 2.3K 2022 PDF
ArMATH Arabic · 6K 2022 PDF
CMATH Chinese · 1.7K 2023 PDF
MathOctopus 10 langs · 73.6K 2023 PDF
HRM8K Korean–English · 8K 2025 PDF
PatiGonit Bengali · 10K 2025 PDF
BMWP Bengali · 8.7K 2025 PDF
PolyMath 18 langs · ~9K 2025 paper
MathMist 13 langs · ~30K 2026 PDF
M3Kang 108 langs · 108K variants 2026 PDF

Tabular Math

  • TabMWP — Lu et al. ICLR, 2023. — 38,431 grade-school problems with tables.
  • FinQA — Chen et al. EMNLP, 2021. — 8,281 financial QA.
  • TAT-QA — Zhu et al. ACL, 2021. — 16,552 hybrid table+text questions.
  • MultiHiertt — Zhao et al. ACL, 2022. — 10,440 multi-hierarchy tables.
  • MultiTabQA — Pal et al. ACL, 2023.
  • Chameleon — Lu et al. NeurIPS, 2024.

Geometry & Multimodal

Dataset Size Year Paper
GEOS 186 2015 PDF
GEOS++ 1,406 2017 PDF
GeoShader 102 2017 PDF
GEOS-OS 2,235 2017 PDF
Geometry3K 3,002 2021 PDF
GeoQA 4,998 2021 PDF
GeoQA+ 12,054 2022 paper
UniGeo 14,541 2022 PDF
PGPS9K 9,022 2023 PDF
MathVista 6,141 2024 PDF
MathVerse 2,612 (×6 variants) 2024 PDF
MATH-Vision 3,040 2024 PDF
MV-MATH 2,009 2025 PDF
We-Math 6,500 2024 PDF

Formal Proving

  • MiniF2F — Zheng, Han, Polu. 2022. — 488 problems, 4 proof systems.
  • ProofNet — Azerbayev et al. 2023. — 371 undergrad theorems with informal/formal pairs.
  • PutnamBench — Tsoukalas et al. NeurIPS D&B, 2024. — 658 Putnam competition problems.
  • Lean Workbook — Ying et al. NeurIPS D&B, 2024. — 57,231 autoformalised problems.
  • DeepTheorem — Zhang et al. 2025. — 121K NL+formal proofs.

Frontier & Live

  • FrontierMath — Glazer et al. arXiv:2411.04872, 2024. — research-grade, private solutions.
  • HLE — Humanity's Last Exam — CAIS, Scale AI, HLE consortium. Nature 649, 2026. — 2,500 expert-authored items.
  • MathArena — Balunović et al. arXiv:2505.23281, 2025. — live contamination-free leaderboard.
  • MMLU — Hendrycks et al. ICLR, 2021.
  • MMLU-Pro — Wang et al. NeurIPS D&B, 2024.
  • GPQA — Rein et al. arXiv:2311.12022, 2023.
  • SuperGPQA — M-A-P Team. arXiv:2502.14739, 2025.
  • LiveBench — White et al. ICLR, 2025.

Process / Verifier Data

  • PRM800K — Lightman et al. ICLR, 2024. — 800K step-level labels.
  • Math-Shepherd — Wang et al. ACL, 2024. — annotation-free MC step estimation.
  • OmegaPRM — Luo et al. arXiv:2406.06592, 2024. — 1.5M+ labels.
  • MetaMathQA — Yu et al. 2024. — 395K backward-rewritten questions.
  • NuminaMath2024. — 860K+ multi-source competition SFT data.

Tabular Math, Scientific & Adjacent Benchmarks

  • SciBench — Wang et al. ICML, 2024. — 869 college-level physics/chemistry/calculus problems.
  • SciEval — Sun et al. AAAI, 2024. — 18K research-paper comprehension + experimental reasoning items.
  • ScienceQA — Lu et al. NeurIPS, 2022. — 21K multimodal grade-school science items.

Section VII — Failure Modes & Open Questions

Topics treated in the survey's failure-mode section:

  • Robustness and spurious correlations — Patel 2021 (SVAMP), Mirzadeh 2024 (GSM-Symbolic).
  • Metric mismatch and path optimality — Pass@1 vs Pass@k vs proof compilation; recommend reporting a process-sensitive measure alongside answer accuracy.
  • Benchmark contamination and the race to saturationMathArena evidence of 10–20pp inflation on AIME 2024 vs AIME 2025.
  • Reward hacking in RLVR training.
  • Multimodal-specific failure modes — MLLMs improving when the diagram is removed (MathVerse).
  • Hallucination in mathematical derivations and proofs.
  • Language transfer and localisationHRM8K, PolyMath gaps.
  • Multi-agent coordination and correlated errors — reviewer-pleasing bias / false consensus, non-termination death spirals, LaTeX-typesetting-creates-false-rigor (from the AI co-mathematician deployment).
  • Energy, carbon, and access.

Section VIII — Mathematicians' Perspectives


Related Surveys

Authors Title Scope Year Venue Paper
Zhang et al. The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers MWP solvers (statistical → neural) 2018 arXiv PDF
Mukherjee & Garain A Review of Methods for Automatic Understanding of Natural-Language Mathematical Problems Pre-DL MWP understanding 2008 AI Review Springer
Sundaram et al. Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers Neural MWP solvers 2022 arXiv PDF
Lu et al. A Survey of Deep Learning for Mathematical Reasoning Deep-learning era MWP/proving/geometry 2023 ACL PDF
Ahn et al. Large Language Models for Mathematical Reasoning: Progresses and Challenges LLM-era reasoning 2024 EACL SRW PDF
Liu et al. Mathematical Language Models: A Survey Math-specialised LLMs and pre-training 2025 ACM Comp. Surveys DOI
Wang et al. A Survey on Large Language Models for Mathematical Reasoning LLM reasoning techniques (CoT, RL, tools) 2026 ACM Comp. Surveys DOI
Yang et al. Formal Mathematical Reasoning: A New Frontier in AI Formal proving with LMs 2024 arXiv PDF
Li et al. A Survey on Deep Learning for Theorem Proving Neural / RL-based theorem provers 2024 COLM PDF
Yan et al. A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model Multimodal math (MLLM) 2025 Findings of ACL DOI
Guo et al. Large Language Model based Multi-Agents: A Survey of Progress and Challenges Multi-agent LLM systems (incl. math) 2024 arXiv PDF

Citation

If this survey or this companion list helps your research, please cite:

@article{raiyan2026ai4mathsurvey,
  title   = {Artificial Intelligence for Mathematical Reasoning:
             An Integrated Survey of Language Models, Neuro-symbolic Systems,
             and Verified Discovery},
  author  = {Raiyan, Syed Rifat and Kabir, Mohsinul and Mahmud, Hasan and Hasan, Md Kamrul},
  journal = {arXiv preprint arXiv:2606.08728},
  year    = {2026}
}

Contributing

Contributions are welcome — please open a PR that:

  1. Adds the paper under the most specific category it fits.
  2. Uses the format **[Title](pdf-url)** — Authors. *Venue Year.*
  3. Keeps the table-of-contents anchors in sync.
  4. For new datasets, adds a row to the appropriate dataset table (with a [PDF] link).
  5. For new model results that update an SOTA cell in the Benchmark Saturation table, also note the protocol (Pass@1 / Pass@k / tree search / TTRL).

If you'd like to suggest a structural change (new section, re-categorisation, alternative ordering), open an issue first so we can discuss.


License

This repository's content (the curated list, summaries, and figures) is released under the MIT License. Linked papers retain their original copyrights.


Made with care for the AI-for-mathematics community. Last updated May 2026.

About

Companion page for "Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery." A curated four-axis index covering MWPs, LLMs and reasoning models, multimodal geometry, Lean theorem proving, and verified discovery (FunSearch, AlphaEvolve, Erdős problems).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors