🧮 AI for Mathematical Reasoning

An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

A companion resource hub for the survey paper. This page indexes every system, dataset, benchmark, and methodology the survey discusses, organised by the same four-axis taxonomy: informal text-only · multimodal · formal · discovery.

TL;DR

What began in the mid-1960s as brittle pattern-matching programs for templated arithmetic has — within the past four years — expanded into a landscape of AI systems that solve Olympiad-level problems, formally verify mathematical arguments in Lean 4, and contribute to the resolution of open problems posed by Paul Erdős. This survey provides an integrated account of that arc, connecting the classical MWP lineage, multimodal geometry, formal theorem proving, and verified mathematical discovery through the reasoning-model era of 2024–2026.

Three things distinguish this survey:

Four-axis taxonomy — informal text, multimodal, formal, discovery — covered as co-equal axes rather than as a single LLM-centric thread.
Supervision-ladder framework — a unified reading of the field's progression from hand-coded schemata through formal proof-assistant kernels as a sequence of increasingly informative external verifiers.
Comprehension–Generation–Verification (CGV) triad — a refinement of recent two-stage LLM-centric framings, promoting verification from an optional post-hoc filter to a co-equal training-time component.

TL;DR
Visual Overview
Benchmark Saturation, 2021–2026
The Four Axes
Section I — Math Word Problem Solving
Section II — The LLM and Reasoning-Model Era
Section III — Multimodal & Geometry
Section IV — Formal Theorem Proving
Section V — Mathematical Discovery & Open Problems
Section VI — Datasets & Benchmarks
Section VII — Failure Modes & Open Questions
Section VIII — Mathematicians' Perspectives
Related Surveys
Citation
Contributing
License

Visual Overview

Paradigm chronology — seven swimlanes from STUDENT (1965) to verified discovery (2026):

Figure: Chronology of AI systems for mathematical reasoning, organised as paradigm swimlanes. Early symbolic and statistical MWP solvers (1965–2016, compressed axis) yield to neural expression generators (2017–2020), prompted and tool-augmented LLMs (2021–2023), and the 2024–2026 convergence of RLVR-trained reasoning models, Lean-based proof assistants, multimodal geometers, and verified-discovery systems. Stars mark IMO-medal milestones.

Methodology funnel — 1,847 candidate records → 238 included:

Benchmark Saturation, 2021–2026

Three patterns: (i) GSM8K and MATH have effectively saturated — both clear 95% by 2025; (ii) AIME 2024, FrontierMath, and PutnamBench show a discontinuity at the onset of the reasoning-model era; (iii) formal proving has moved fastest in relative terms — MiniF2F-test rose from ~25% (2022) to 93% (2026); PutnamBench from near-zero to over 60% in just over a year.

Benchmark	Pre-DL	Seq2Tree	LLM era	Reasoning era
MAWPS	~60%	92.0%	>95%	>99%
GSM8K	—	55%	94.6% (GPT-4o)	97.3% (Kimi K2)
MATH-500	—	6.9% (GPT-3)	52% (GPT-4)	99.2% (LongCat)
AIME 2024	—	—	12% (GPT-4o)	91.6% (o3)
AIME 2025	—	—	—	94.6% (SU-01 / GPT-5.5 H)
GPQA-D	—	—	53.6% (GPT-4o)	94.6% (Claude Mythos)
FrontierMath	—	—	<2% (GPT-4o)	48% (co-math. T4)ᶧ
MiniF2F-test	—	—	50% (DSP-V1)	99.6% (AlphaProof + TTRL)
PutnamBench	—	—	—	86/658 (Goedel-V2-32B)
IMO-ProofBench	—	—	—	80.7% (GPT-5.5 H)

_{ᶧ Tier-4 (research grade); other rows are Tier 1–3 average.}

The Four Axes

Axis	Defining task	Representative artefact	Verifier
Informal text-only	MWPs, competition QA	Numeric answer, CoT trace	Exact-match / answer parser
Multimodal & geometry	Diagram + text problems	Construction, proof sketch	Symbolic geometry solver, MLLM grader
Formal proving	Theorems in Lean / Coq / Isabelle	Tactic script, proof term	Proof-assistant kernel
Mathematical discovery	Open problems, improved bounds	Program, construction, verified theorem	Domain evaluator + Lean kernel + expert audit

Section I — Math Word Problem Solving

Robustness & Probing Benchmarks

SVAMP — Patel, Bhattamishra, Goyal. NAACL, 2021. — paraphrase / distractor / reordering perturbations.
ParaMAWPS — Math Word Problem Solving by Generating Linguistic Variants — Raiyan et al. ACL SRW, 2023.
GSM-Symbolic — Mirzadeh et al. arXiv:2410.05229, 2024. — name/number substitution; ≤65% drop with one irrelevant clause.
Functional benchmarks for robust evaluation — Srivastava et al. arXiv:2402.19450, 2024.
Shortcut learning in deep neural networks — Geirhos et al. Nature Machine Intelligence, 2020.

Section II — The LLM and Reasoning-Model Era

This section corresponds to the LLMs & Agents and Reasoning & Verification bands of the chronology figure.

Prompting-Era Innovations

Chain-of-Thought prompting elicits reasoning in large language models — Wei et al. NeurIPS, 2022.
Self-Consistency Improves Chain of Thought Reasoning — Wang et al. ICLR, 2023.
Large language models are zero-shot reasoners — Kojima et al. NeurIPS, 2022.
Least-to-Most Prompting Enables Complex Reasoning — Zhou et al. ICLR, 2023.
Tree of Thoughts — Yao et al. NeurIPS, 2023.
Graph of Thoughts — Besta et al. AAAI, 2024.
DUP — Deeply Understanding the Problems — Zhong et al. arXiv:2404.14963, 2024. — 97.1% on GSM8K via semantic decomposition.

Tool-Integrated Reasoning

PAL — Program-Aided Language Models — Gao et al. ICML, 2023.
PoT — Program of Thoughts — Chen et al. TMLR, 2023.
ToRA — Tool-Integrated Reasoning Agent — Gou et al. ICLR, 2024.
Code to think, think to code — Yang et al. arXiv:2502.19411, 2025. — survey of code-enhanced reasoning.

Self-Improvement & Bootstrapping

STaR — Bootstrapping reasoning with reasoning — Zelikman et al. NeurIPS, 2022.
Quiet-STaR — Zelikman et al. arXiv:2403.09629, 2024.
V-STaR — Training verifiers for self-taught reasoners — Hosseini et al. COLM, 2024.
ReFT — Reasoning with reinforced fine-tuning — Luong et al. ACL, 2024.
LIMO — Less is more for reasoning — Ye et al. arXiv:2502.03387, 2025.
SCoRe — Training LMs to self-correct via RL — Kumar et al. arXiv:2409.12917, 2024.
rStar-Math — Small LLMs master math via self-evolved deep thinking — Guan et al. arXiv:2404.07846, 2025.

Multi-Agent & Agentic Reasoning

Improving factuality and reasoning through multi-agent debate — Du et al. arXiv:2305.14325, 2023.
MAD — Encouraging divergent thinking through multi-agent debate — Liang et al. EMNLP, 2024.
ReConcile — Round-table conference among diverse LLMs — Chen et al. ACL, 2024.
DyLAN — Dynamic LLM-powered agent network — Liu et al. COLM, 2024.
Mixture-of-Agents (MoA) — Wang et al. arXiv:2406.04692, 2024.
Graph-of-Agents (GoA) — Yun et al. ICLR, 2026.
MAgICoRe — Iterative coarse-to-fine refinement — Chen et al. EMNLP, 2025.
Mars-PO — Multi-agent reasoning system preference optimisation — Lou et al. arXiv:2411.19039, 2024.
MALT — Multi-agent LLM training — Motwani et al. COLM, 2025.
MATTRL — Collaborative test-time RL for reasoning — Hu et al. arXiv:2601.09667, 2026.
LLM-based Multi-Agents: A Survey — Guo et al. arXiv:2402.01680, 2024.

Math-Specialized Foundation Models

Minerva — Solving quantitative reasoning problems with language models — Lewkowycz et al. NeurIPS, 2022.
Llemma — Open language model for mathematics — Azerbayev et al. ICLR, 2024.
DeepSeekMath — Shao et al. arXiv:2402.03300, 2024. — introduces GRPO.
Qwen2.5-Math — Yang et al. arXiv:2409.12122, 2024.
InternLM2-Math (Plus) — Ying et al. arXiv:2402.06332, 2024.
MetaMath — Yu et al. ICLR, 2024.
MAmmoTH — Yue et al. ICLR, 2024.
WizardMath — Luo et al. arXiv:2308.09583, 2023.

Process / Outcome Reward Models

Solving MWPs with process- and outcome-based feedback — Uesato et al. arXiv:2211.14275, 2022.
Let's verify step by step (PRM800K) — Lightman et al. ICLR, 2024.
Math-Shepherd — Verify and reinforce without human annotations — Wang et al. ACL, 2024.
OmegaPRM — Automated process supervision — Luo et al. arXiv:2406.06592, 2024.

Reasoning Models (o-series, R1, Kimi k1.5, SU-01, …)

OpenAI o1 — Learning to Reason with LLMs — OpenAI tech report, 2024.
DeepSeek-R1 — Incentivizing reasoning via RL — Guo et al. arXiv:2501.12948, 2025.
Kimi k1.5 — Scaling RL with LLMs — Kimi Team. arXiv:2501.12599, 2025.
s1 — Simple test-time scaling — Muennighoff et al. arXiv:2501.19393, 2025.
Scaling LLM test-time compute optimally — Snell et al. arXiv:2408.03314, 2024.
SU-01 — Gold-medal-level olympiad reasoning via simple and unified scaling — Li, Zhan, Zhang et al. arXiv:2605.13301, 2026. — 30B-A3B model, gold on IMO 2025 (35/42) and USAMO 2026 (35/42).

RL Algorithms for Reasoning

PPO — Proximal Policy Optimisation — Schulman et al. arXiv:1707.06347, 2017.
ReMax — REINFORCE-style baseline for LLM alignment — Li et al. ICML, 2024.
RLOO — Back to basics: REINFORCE-style optimisation for RLHF — Ahmadian et al. ACL, 2024.
REINFORCE++ — Hu et al. arXiv:2501.03262, 2025.
GRPO (introduced with DeepSeekMath) — Shao et al. arXiv:2402.03300, 2024. — critic-free group-normalised advantages.
DAPO — Open-source LLM RL system at scale — Yu et al. NeurIPS, 2025.
DPO — Direct Preference Optimisation — Rafailov et al. NeurIPS, 2023.
Step-DPO — Lai et al. arXiv:2406.18629, 2024.

Section III — Multimodal & Geometry

Classical Geometry Solvers

GEOS — Solving geometry problems: combining text and diagram interpretation — Seo et al. EMNLP, 2015.
GEOS++ — From textbooks to knowledge — Sachan, Dubey, Xing. EMNLP, 2017.
GeoShader — Synthesis of solutions for shaded-area problems — Alvin et al. FLAIRS, 2017.
GEOS-OS — Learning to solve geometry problems from textbook demonstrations — Sachan & Xing. *SEM, 2017.
Wu's method — On the decision problem and mechanisation of theorem-proving in elementary geometry — Wu. Scientia Sinica, 1978.
Wu's method revisited (Sinha et al.) — NeurIPS, 2024. — careful JGEX reimplementation solves 15/30 IMO-AG-30; Wu + AG = 27/30.

Neuro-symbolic & Neural GPS

InterGPS — Interpretable geometry problem solving — Lu et al. arXiv:2105.04165, 2021.
GeoQA — Multimodal numerical reasoning benchmark — Chen et al. arXiv:2105.14517, 2021.
PGPSNet — Multi-modal neural geometric solver — Zhang et al. IJCAI, 2023.
GeoDRL — Self-learning framework using RL in deductive reasoning — Peng et al. Findings of ACL, 2023.
LANS — Layout-aware neural solver for plane geometry — Li et al. Findings of ACL, 2024.
Pi-GPS — Unleashing the power of diagrammatic information — Zhao et al. ICCV, 2025.
FGeo-HyperGNet — FormalGeo + hypergraph neural network — Zhang et al. IJCAI, 2025.
MARS-GPS — Multi-CoT voting for geometric reasoning — Siddique et al. arXiv:2604.00890, 2026. — 88.8/77.48 on Geo3K/PGPS9K.

Olympiad Geometry (AlphaGeometry family, TongGeometry)

AlphaGeometry — Solving olympiad geometry without human demonstrations — Trinh et al. Nature 625, 2024. — 25/30 on IMO-AG-30.
AlphaGeometry 2 — Gold-medalist performance — Chervonyi et al. arXiv:2502.03544, 2025. — 30/30 (IMO-AG-30 full setup), 42/50 (IMO-AG-50).
TongGeometry — Guided tree search — Zhang et al. Nature Machine Intelligence, 2026. — 30/30 on IMO-AG-30 with consumer hardware.

Vision–Language Models for Math

G-LLaVA — Solving geometric problems with a multimodal LLM — Gao et al. arXiv:2312.11370, 2023.
MAVIS — Mathematical visual instruction tuning — Zhang et al. arXiv:2407.08739, 2024.
MINT-CoT — Interleaved visual tokens in mathematical CoT — Chen et al. NeurIPS, 2025.
R1-V — Reinforcing super-generalisation in VLMs — Chen et al. arXiv:2501.12015, 2025.
MM-Eureka — Visual aha moment via rule-based RL — Meng et al. arXiv:2503.07365, 2025.
GeoGRPO — Stepwise-GRPO in RLHF for geometry — Jiang et al. LNCS Springer, 2025. — 80.6 GeoQA / 73.4 MathVista.
Qwen2-VL — Wang et al. arXiv:2409.12191, 2024.
Qwen2.5-VL — Bai et al. arXiv:2502.13923, 2025.

Section IV — Formal Theorem Proving

Tactic Prediction & Search

GPT-f — Generative language modeling for automated theorem proving — Polu & Sutskever. arXiv:2009.03393, 2020. — 700M parameters.
HyperTree Proof Search — Lample et al. NeurIPS, 2022. — AlphaZero-style MCTS over proof states.
Draft, Sketch and Prove (DSP) — Jiang et al. ICLR, 2023. — informal-proof-guided formalisation.
Baldur — Whole-proof generation and repair — First et al. FSE, 2023.

Retrieval-Augmented Proving

LeanDojo — Theorem proving with retrieval-augmented LMs — Yang et al. NeurIPS D&B, 2023. — 26.5% Pass@1 on miniF2F-test.
Lean Copilot — LLMs as copilots for Lean — Song, Yang, Anandkumar. arXiv:2404.12534, 2024.

Whole-Proof Generation

Lean-STaR — Learning to interleave thinking and proving — Lin et al. ICLR, 2025. — 46.3% Pass@64 on miniF2F.
InternLM2-Math-Plus — Ying et al. 2024. — 43.4% Pass@32 on miniF2F.
DeepTheorem — Theorem proving via NL + RL — Zhang et al. arXiv:2505.23754, 2025.

Expert Iteration + RL Provers

DeepSeek-Prover V1 — Xin et al. arXiv:2405.14333, 2024. — 50.0% miniF2F (Pass@65k).
DeepSeek-Prover V1.5 — Xin et al. ICLR, 2025. — 63.5% miniF2F via RMaxTS.
DeepSeek-Prover V2 — Ren et al. arXiv:2504.21801, 2025. — 88.9% miniF2F, 49/658 PutnamBench (671B).
Kimina-Prover Preview — Wang et al. arXiv:2504.11354, 2025. — 80.7% miniF2F (72B).
Goedel-Prover V1 — Lin et al. arXiv:2502.07640, 2025. — 57.6% miniF2F, 7/658 PutnamBench.
Goedel-Prover V2 — Lin et al. arXiv:2508.03613, 2025. — 90.4% miniF2F, 86/658 PutnamBench (32B).
AlphaProof — Olympiad-level formal reasoning with RL — Hubert et al. Nature 651, 2026. — silver-tier IMO 2024 (28/42, including P6).
Aristotle — IMO-level automated theorem proving — Achim et al. arXiv:2510.01346, 2025. — used in Erdős solves.

Autoformalisation & Library Building

Autoformalisation with large language models — Wu et al. NeurIPS, 2022.
Improving autoformalisation using type checking — Poiroux et al. arXiv:2406.07222, 2024.
PDA — Process-Driven Autoformalisation in Lean 4 — Jiang et al. arXiv:2406.01940, 2024.
MathlibLemma — Folklore lemma generation — Liu et al. arXiv:2602.02561, 2026. — multi-agent Discovery/Judge/Formaliser.
The mathlib library — Lean mathlib community.

Compiler-Guided Repair

APOLLO — Automated LLM + Lean collaboration for proof repair — Sun et al. arXiv:2505.05758, 2025. — 84.9% miniF2F at ≤8B.

Section V — Mathematical Discovery & Open Problems

Program Search

FunSearch — Mathematical discoveries from program search — Romera-Paredes et al. Nature 625, 2024. — cap-set + bin-packing.
AlphaEvolve — A coding agent for scientific and algorithmic discovery — Novikov et al. arXiv:2506.13131, 2025. — improved Strassen on 4×4 complex matrices (48 mult.).
Mathematical exploration and discovery at scale — Georgiev, Gómez-Serrano, Tao, Wagner. arXiv:2511.02864, 2025. — 67 problems across analysis, combinatorics, geometry, number theory.

Erdős Workflow & AI-Assisted Solves

The Erdős Problems Website — Bloom. 2024. — 1,133 catalogued conjectures.
AI Contributions to Erdős Problems — Alexeev et al. community wiki, 2025–2026. — solves of #1026, #728, #729, #397.
Aristotle (used by B. Alexeev for #1026) — Achim et al. 2025.

Research Workbench Mode

AI co-mathematician — Accelerating mathematicians with agentic AI — Zheng et al. arXiv:2605.06651, 2026. — Google's hierarchical agent workbench; 48% on FrontierMath Tier 4; solves Kourovka 21.10, Stirling log-concavity, Hamiltonian perturbation lemma.
Aletheia — Towards autonomous mathematics research — Feng et al. arXiv:2602.10177, 2026.

Section VI — Datasets & Benchmarks

Classical MWP

Dataset	Size	Lang.	Year	Paper
AI2 (AddSub)	395	EN	2014	PDF
SingleEQ	508	EN	2015	PDF
Alg514	514	EN	2014	PDF
Dolphin18K	18,460	EN	2016	PDF
MAWPS	3,320	EN	2016	PDF
ASDiv	2,305	EN	2020	PDF
AQuA	97,975	EN	2017	PDF
Math23K	23,162	ZH	2017	PDF
Ape210K	210,488	ZH	2020	PDF
MathQA	37,297	EN	2019	PDF
SVAMP	1,000	EN	2021	PDF
ParaMAWPS	16,278	EN	2023	PDF
GSM-Symbolic	5,000	EN	2024	PDF

Competition & Olympiad

Dataset	Size	Year	Paper
GSM8K	8,792	2021	PDF
MATH	12,500	2021	PDF
OlympiadBench	8,476	2024	PDF
Omni-MATH	4,428	2024	PDF
FrontierMath	~350	2024	PDF
MathArena	149+/yr (live)	2025	PDF
IMO-ProofBench	60	2025	PDF

Multilingual Math

Dataset	Coverage	Year	Paper
MGSM	10 langs · 2.5K	2022	PDF
HAWP	Hindi · 2.3K	2022	PDF
ArMATH	Arabic · 6K	2022	PDF
CMATH	Chinese · 1.7K	2023	PDF
MathOctopus	10 langs · 73.6K	2023	PDF
HRM8K	Korean–English · 8K	2025	PDF
PatiGonit	Bengali · 10K	2025	PDF
BMWP	Bengali · 8.7K	2025	PDF
PolyMath	18 langs · ~9K	2025	paper
MathMist	13 langs · ~30K	2026	PDF
M3Kang	108 langs · 108K variants	2026	PDF

Tabular Math

TabMWP — Lu et al. ICLR, 2023. — 38,431 grade-school problems with tables.
FinQA — Chen et al. EMNLP, 2021. — 8,281 financial QA.
TAT-QA — Zhu et al. ACL, 2021. — 16,552 hybrid table+text questions.
MultiHiertt — Zhao et al. ACL, 2022. — 10,440 multi-hierarchy tables.
MultiTabQA — Pal et al. ACL, 2023.
Chameleon — Lu et al. NeurIPS, 2024.

Geometry & Multimodal

Dataset	Size	Year	Paper
GEOS	186	2015	PDF
GEOS++	1,406	2017	PDF
GeoShader	102	2017	PDF
GEOS-OS	2,235	2017	PDF
Geometry3K	3,002	2021	PDF
GeoQA	4,998	2021	PDF
GeoQA+	12,054	2022	paper
UniGeo	14,541	2022	PDF
PGPS9K	9,022	2023	PDF
MathVista	6,141	2024	PDF
MathVerse	2,612 (×6 variants)	2024	PDF
MATH-Vision	3,040	2024	PDF
MV-MATH	2,009	2025	PDF
We-Math	6,500	2024	PDF

Formal Proving

MiniF2F — Zheng, Han, Polu. 2022. — 488 problems, 4 proof systems.
ProofNet — Azerbayev et al. 2023. — 371 undergrad theorems with informal/formal pairs.
PutnamBench — Tsoukalas et al. NeurIPS D&B, 2024. — 658 Putnam competition problems.
Lean Workbook — Ying et al. NeurIPS D&B, 2024. — 57,231 autoformalised problems.
DeepTheorem — Zhang et al. 2025. — 121K NL+formal proofs.

Frontier & Live

FrontierMath — Glazer et al. arXiv:2411.04872, 2024. — research-grade, private solutions.
HLE — Humanity's Last Exam — CAIS, Scale AI, HLE consortium. Nature 649, 2026. — 2,500 expert-authored items.
MathArena — Balunović et al. arXiv:2505.23281, 2025. — live contamination-free leaderboard.
MMLU — Hendrycks et al. ICLR, 2021.
MMLU-Pro — Wang et al. NeurIPS D&B, 2024.
GPQA — Rein et al. arXiv:2311.12022, 2023.
SuperGPQA — M-A-P Team. arXiv:2502.14739, 2025.
LiveBench — White et al. ICLR, 2025.

Process / Verifier Data

PRM800K — Lightman et al. ICLR, 2024. — 800K step-level labels.
Math-Shepherd — Wang et al. ACL, 2024. — annotation-free MC step estimation.
OmegaPRM — Luo et al. arXiv:2406.06592, 2024. — 1.5M+ labels.
MetaMathQA — Yu et al. 2024. — 395K backward-rewritten questions.
NuminaMath — 2024. — 860K+ multi-source competition SFT data.

Tabular Math, Scientific & Adjacent Benchmarks

SciBench — Wang et al. ICML, 2024. — 869 college-level physics/chemistry/calculus problems.
SciEval — Sun et al. AAAI, 2024. — 18K research-paper comprehension + experimental reasoning items.
ScienceQA — Lu et al. NeurIPS, 2022. — 21K multimodal grade-school science items.

Section VII — Failure Modes & Open Questions

Topics treated in the survey's failure-mode section:

Robustness and spurious correlations — Patel 2021 (SVAMP), Mirzadeh 2024 (GSM-Symbolic).
Metric mismatch and path optimality — Pass@1 vs Pass@k vs proof compilation; recommend reporting a process-sensitive measure alongside answer accuracy.
Benchmark contamination and the race to saturation — MathArena evidence of 10–20pp inflation on AIME 2024 vs AIME 2025.
Reward hacking in RLVR training.
Multimodal-specific failure modes — MLLMs improving when the diagram is removed (MathVerse).
Hallucination in mathematical derivations and proofs.
Language transfer and localisation — HRM8K, PolyMath gaps.
Multi-agent coordination and correlated errors — reviewer-pleasing bias / false consensus, non-termination death spirals, LaTeX-typesetting-creates-false-rigor (from the AI co-mathematician deployment).
Energy, carbon, and access.

Section VIII — Mathematicians' Perspectives

Machine-Assisted Proof — Tao. 2024.
AI Will Become Mathematicians' Co-Pilot — Tao. Scientific American interview, 2024.
AI Is Ready for Primetime in Math and Theoretical Physics — Tao. IPAM talk, OpenAI Academy, 2026.
Mathematical methods and human thought in the age of AI — Klowden & Tao. Blackwell Companion to the Philosophy of Mathematics (to appear), 2026. — blog post

Related Surveys

Authors	Title	Scope	Year	Venue	Paper
Zhang et al.	The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers	MWP solvers (statistical → neural)	2018	arXiv	PDF
Mukherjee & Garain	A Review of Methods for Automatic Understanding of Natural-Language Mathematical Problems	Pre-DL MWP understanding	2008	AI Review	Springer
Sundaram et al.	Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers	Neural MWP solvers	2022	arXiv	PDF
Lu et al.	A Survey of Deep Learning for Mathematical Reasoning	Deep-learning era MWP/proving/geometry	2023	ACL	PDF
Ahn et al.	Large Language Models for Mathematical Reasoning: Progresses and Challenges	LLM-era reasoning	2024	EACL SRW	PDF
Liu et al.	Mathematical Language Models: A Survey	Math-specialised LLMs and pre-training	2025	ACM Comp. Surveys	DOI
Wang et al.	A Survey on Large Language Models for Mathematical Reasoning	LLM reasoning techniques (CoT, RL, tools)	2026	ACM Comp. Surveys	DOI
Yang et al.	Formal Mathematical Reasoning: A New Frontier in AI	Formal proving with LMs	2024	arXiv	PDF
Li et al.	A Survey on Deep Learning for Theorem Proving	Neural / RL-based theorem provers	2024	COLM	PDF
Yan et al.	A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model	Multimodal math (MLLM)	2025	Findings of ACL	DOI
Guo et al.	Large Language Model based Multi-Agents: A Survey of Progress and Challenges	Multi-agent LLM systems (incl. math)	2024	arXiv	PDF

Citation

If this survey or this companion list helps your research, please cite:

@article{raiyan2026ai4mathsurvey,
  title   = {Artificial Intelligence for Mathematical Reasoning:
             An Integrated Survey of Language Models, Neuro-symbolic Systems,
             and Verified Discovery},
  author  = {Raiyan, Syed Rifat and Kabir, Mohsinul and Mahmud, Hasan and Hasan, Md Kamrul},
  journal = {arXiv preprint arXiv:2606.08728},
  year    = {2026}
}

Contributing

Contributions are welcome — please open a PR that:

Adds the paper under the most specific category it fits.
Uses the format **[Title](pdf-url)** — Authors. *Venue Year.*
Keeps the table-of-contents anchors in sync.
For new datasets, adds a row to the appropriate dataset table (with a [PDF] link).
For new model results that update an SOTA cell in the Benchmark Saturation table, also note the protocol (Pass@1 / Pass@k / tree search / TTRL).

If you'd like to suggest a structural change (new section, re-categorisation, alternative ordering), open an issue first so we can discuss.

License

This repository's content (the curated list, summaries, and figures) is released under the MIT License. Linked papers retain their original copyrights.

_{Made with care for the AI-for-mathematics community. Last updated May 2026.}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧮 AI for Mathematical Reasoning

An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

TL;DR

Contents

Visual Overview

Benchmark Saturation, 2021–2026

The Four Axes

Section I — Math Word Problem Solving

Rule-based & Statistical

Tree-based & Semantic Parsing

Neural MWP (Seq2Seq, GTS, Graph2Tree)

Robustness & Probing Benchmarks

Section II — The LLM and Reasoning-Model Era

Prompting-Era Innovations

Tool-Integrated Reasoning

Self-Improvement & Bootstrapping

Multi-Agent & Agentic Reasoning

Math-Specialized Foundation Models

Process / Outcome Reward Models

Reasoning Models (o-series, R1, Kimi k1.5, SU-01, …)

RL Algorithms for Reasoning

Section III — Multimodal & Geometry

Classical Geometry Solvers

Neuro-symbolic & Neural GPS

Olympiad Geometry (AlphaGeometry family, TongGeometry)

Vision–Language Models for Math

Section IV — Formal Theorem Proving

Tactic Prediction & Search

Retrieval-Augmented Proving

Whole-Proof Generation

Expert Iteration + RL Provers

Autoformalisation & Library Building

Compiler-Guided Repair

Section V — Mathematical Discovery & Open Problems

Program Search

Erdős Workflow & AI-Assisted Solves

Research Workbench Mode

Section VI — Datasets & Benchmarks

Classical MWP

Competition & Olympiad

Multilingual Math

Tabular Math

Geometry & Multimodal

Formal Proving

Frontier & Live

Process / Verifier Data

Tabular Math, Scientific & Adjacent Benchmarks

Section VII — Failure Modes & Open Questions

Section VIII — Mathematicians' Perspectives

Related Surveys

Citation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages