A companion resource hub for the survey paper. This page indexes every system, dataset, benchmark, and methodology the survey discusses, organised by the same four-axis taxonomy: informal text-only · multimodal · formal · discovery.
What began in the mid-1960s as brittle pattern-matching programs for templated arithmetic has — within the past four years — expanded into a landscape of AI systems that solve Olympiad-level problems, formally verify mathematical arguments in Lean 4, and contribute to the resolution of open problems posed by Paul Erdős. This survey provides an integrated account of that arc, connecting the classical MWP lineage, multimodal geometry, formal theorem proving, and verified mathematical discovery through the reasoning-model era of 2024–2026.
Three things distinguish this survey:
- Four-axis taxonomy — informal text, multimodal, formal, discovery — covered as co-equal axes rather than as a single LLM-centric thread.
- Supervision-ladder framework — a unified reading of the field's progression from hand-coded schemata through formal proof-assistant kernels as a sequence of increasingly informative external verifiers.
- Comprehension–Generation–Verification (CGV) triad — a refinement of recent two-stage LLM-centric framings, promoting verification from an optional post-hoc filter to a co-equal training-time component.
- TL;DR
- Visual Overview
- Benchmark Saturation, 2021–2026
- The Four Axes
- Section I — Math Word Problem Solving
- Section II — The LLM and Reasoning-Model Era
- Section III — Multimodal & Geometry
- Section IV — Formal Theorem Proving
- Section V — Mathematical Discovery & Open Problems
- Section VI — Datasets & Benchmarks
- Section VII — Failure Modes & Open Questions
- Section VIII — Mathematicians' Perspectives
- Related Surveys
- Citation
- Contributing
- License
Paradigm chronology — seven swimlanes from STUDENT (1965) to verified discovery (2026):
Figure: Chronology of AI systems for mathematical reasoning, organised as paradigm swimlanes. Early symbolic and statistical MWP solvers (1965–2016, compressed axis) yield to neural expression generators (2017–2020), prompted and tool-augmented LLMs (2021–2023), and the 2024–2026 convergence of RLVR-trained reasoning models, Lean-based proof assistants, multimodal geometers, and verified-discovery systems. Stars mark IMO-medal milestones.
Methodology funnel — 1,847 candidate records → 238 included:
Three patterns: (i) GSM8K and MATH have effectively saturated — both clear 95% by 2025; (ii) AIME 2024, FrontierMath, and PutnamBench show a discontinuity at the onset of the reasoning-model era; (iii) formal proving has moved fastest in relative terms — MiniF2F-test rose from ~25% (2022) to 93% (2026); PutnamBench from near-zero to over 60% in just over a year.
| Benchmark | Pre-DL | Seq2Tree | LLM era | Reasoning era |
|---|---|---|---|---|
| MAWPS | ~60% | 92.0% | >95% | >99% |
| GSM8K | — | 55% | 94.6% (GPT-4o) | 97.3% (Kimi K2) |
| MATH-500 | — | 6.9% (GPT-3) | 52% (GPT-4) | 99.2% (LongCat) |
| AIME 2024 | — | — | 12% (GPT-4o) | 91.6% (o3) |
| AIME 2025 | — | — | — | 94.6% (SU-01 / GPT-5.5 H) |
| GPQA-D | — | — | 53.6% (GPT-4o) | 94.6% (Claude Mythos) |
| FrontierMath | — | — | <2% (GPT-4o) | 48% *(co-math. T4)*ᶧ |
| MiniF2F-test | — | — | 50% (DSP-V1) | 99.6% (AlphaProof + TTRL) |
| PutnamBench | — | — | — | 86/658 (Goedel-V2-32B) |
| IMO-ProofBench | — | — | — | 80.7% (GPT-5.5 H) |
ᶧ Tier-4 (research grade); other rows are Tier 1–3 average.
| Axis | Defining task | Representative artefact | Verifier |
|---|---|---|---|
| Informal text-only | MWPs, competition QA | Numeric answer, CoT trace | Exact-match / answer parser |
| Multimodal & geometry | Diagram + text problems | Construction, proof sketch | Symbolic geometry solver, MLLM grader |
| Formal proving | Theorems in Lean / Coq / Isabelle | Tactic script, proof term | Proof-assistant kernel |
| Mathematical discovery | Open problems, improved bounds | Program, construction, verified theorem | Domain evaluator + Lean kernel + expert audit |
- STUDENT — Natural language input for a computer problem solving system — Bobrow. MIT AI Memo, 1964.
- Computers and Thought — Feigenbaum & Feldman (eds.). McGraw-Hill, 1963.
- Understanding and solving arithmetic word problems: A computer simulation (WordPro) — Fletcher. Behaviour Research Methods, 1985.
- Robust understanding of word problems with extraneous information — Bakman. arXiv:math/0701393, 2007.
- Frame-based calculus of solving arithmetic multi-step addition and subtraction word problems — Yuhui et al. ETCS, 2010.
- A review of methods for automatic understanding of natural-language mathematical problems — Mukherjee & Garain. Artificial Intelligence Review, 2008.
- Learning to automatically solve algebra word problems — Kushman et al. ACL, 2014.
- Reasoning about quantities in natural language — Roy, Vieira, Roth. TACL, 2015.
- ARIS — Learning to solve arithmetic problems with verb categorisation — Hosseini et al. EMNLP, 2014.
- Learn to solve algebra word problems using quadratic programming — Zhou, Dai, Chen. EMNLP, 2015.
- Learning to use formulae to solve simple arithmetic problems — Mitra & Baral. ACL, 2016.
- A tag-based English math word problem solver — Liang et al. NAACL Demos, 2016.
- Unit dependency graph and its application to arithmetic word problem solving — Roy & Roth. AAAI, 2017.
- Solving general arithmetic word problems — Roy & Roth. arXiv:1608.01413, 2016.
- Equation parsing: Mapping sentences to grounded equations — Roy, Upadhyay, Roth. arXiv:1609.08824, 2016.
- Parsing algebraic word problems into equations — Koncel-Kedziorski et al. TACL, 2015.
- A goal-driven tree-structured neural model for math word problems (GTS) — Xie & Sun. IJCAI, 2019.
- Tree-structured decoding for solving math word problems — Liu et al. EMNLP-IJCNLP, 2019.
- Graph2Tree — Graph-to-tree learning for solving math word problems — Zhang et al. ACL, 2020.
- A knowledge-aware sequence-to-tree network — Wu et al. EMNLP, 2020.
- Teacher–student networks with multiple decoders — Zhang et al. IJCAI, 2020.
- Improving MWPs with pre-trained knowledge and hierarchical reasoning — Yu et al. EMNLP, 2021.
- HMS — A hierarchical solver with dependency-enhanced understanding — Lin et al. AAAI, 2021.
- Neural-symbolic solver for MWPs with auxiliary tasks — Qin et al. arXiv:2107.01431, 2021.
- MWP-BERT — A strong baseline for MWPs — Liang et al. arXiv:2107.13435, 2021.
- Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction — Jie, Li, Lu. ACL, 2022.
- Deep neural solver for math word problems — Wang, Liu, Shi. EMNLP, 2017. — first large-scale Seq2Seq on Math23K.
- Translating a math word problem to an expression tree — Wang et al. arXiv:1811.05632, 2018.
- Semantically-aligned equation generation — Chiang & Chen. NAACL, 2019.
- Modeling intra-relation in MWPs with multi-head attention — Li et al. ACL, 2019.
- Template-based MWP solvers with recursive neural networks — Wang et al. AAAI, 2019.
- Neural MWP solver with reinforcement learning — Huang et al. COLING, 2018.
- Reverse-operation-based data augmentation — Liu et al. arXiv:2010.01556, 2020.
- Point to the expression — Expression-pointer transformer — Kim et al. EMNLP, 2020.
- Solving MWPs with multi-encoders and multi-decoders — Shen & Jin. COLING, 2020.
- SMART — Situation model for algebra story problems via attributed grammar — Hong et al. AAAI, 2021.
- Learning by fixing — Weakly supervised MWPs — Hong et al. AAAI, 2021.
- Recall and Learn — Memory-augmented MWP solver — Huang et al. arXiv:2109.13112, 2021.
- Are NLP models really able to solve simple math word problems? (SVAMP) — Patel et al. NAACL, 2021.
- Generate & Rank — Multi-task framework for MWPs — Shen et al. arXiv:2109.03034, 2021.
- SVAMP — Patel, Bhattamishra, Goyal. NAACL, 2021. — paraphrase / distractor / reordering perturbations.
- ParaMAWPS — Math Word Problem Solving by Generating Linguistic Variants — Raiyan et al. ACL SRW, 2023.
- GSM-Symbolic — Mirzadeh et al. arXiv:2410.05229, 2024. — name/number substitution; ≤65% drop with one irrelevant clause.
- Functional benchmarks for robust evaluation — Srivastava et al. arXiv:2402.19450, 2024.
- Shortcut learning in deep neural networks — Geirhos et al. Nature Machine Intelligence, 2020.
This section corresponds to the LLMs & Agents and Reasoning & Verification bands of the chronology figure.
- Chain-of-Thought prompting elicits reasoning in large language models — Wei et al. NeurIPS, 2022.
- Self-Consistency Improves Chain of Thought Reasoning — Wang et al. ICLR, 2023.
- Large language models are zero-shot reasoners — Kojima et al. NeurIPS, 2022.
- Least-to-Most Prompting Enables Complex Reasoning — Zhou et al. ICLR, 2023.
- Tree of Thoughts — Yao et al. NeurIPS, 2023.
- Graph of Thoughts — Besta et al. AAAI, 2024.
- DUP — Deeply Understanding the Problems — Zhong et al. arXiv:2404.14963, 2024. — 97.1% on GSM8K via semantic decomposition.
- PAL — Program-Aided Language Models — Gao et al. ICML, 2023.
- PoT — Program of Thoughts — Chen et al. TMLR, 2023.
- ToRA — Tool-Integrated Reasoning Agent — Gou et al. ICLR, 2024.
- Code to think, think to code — Yang et al. arXiv:2502.19411, 2025. — survey of code-enhanced reasoning.
- STaR — Bootstrapping reasoning with reasoning — Zelikman et al. NeurIPS, 2022.
- Quiet-STaR — Zelikman et al. arXiv:2403.09629, 2024.
- V-STaR — Training verifiers for self-taught reasoners — Hosseini et al. COLM, 2024.
- ReFT — Reasoning with reinforced fine-tuning — Luong et al. ACL, 2024.
- LIMO — Less is more for reasoning — Ye et al. arXiv:2502.03387, 2025.
- SCoRe — Training LMs to self-correct via RL — Kumar et al. arXiv:2409.12917, 2024.
- rStar-Math — Small LLMs master math via self-evolved deep thinking — Guan et al. arXiv:2404.07846, 2025.
- Improving factuality and reasoning through multi-agent debate — Du et al. arXiv:2305.14325, 2023.
- MAD — Encouraging divergent thinking through multi-agent debate — Liang et al. EMNLP, 2024.
- ReConcile — Round-table conference among diverse LLMs — Chen et al. ACL, 2024.
- DyLAN — Dynamic LLM-powered agent network — Liu et al. COLM, 2024.
- Mixture-of-Agents (MoA) — Wang et al. arXiv:2406.04692, 2024.
- Graph-of-Agents (GoA) — Yun et al. ICLR, 2026.
- MAgICoRe — Iterative coarse-to-fine refinement — Chen et al. EMNLP, 2025.
- Mars-PO — Multi-agent reasoning system preference optimisation — Lou et al. arXiv:2411.19039, 2024.
- MALT — Multi-agent LLM training — Motwani et al. COLM, 2025.
- MATTRL — Collaborative test-time RL for reasoning — Hu et al. arXiv:2601.09667, 2026.
- LLM-based Multi-Agents: A Survey — Guo et al. arXiv:2402.01680, 2024.
- Minerva — Solving quantitative reasoning problems with language models — Lewkowycz et al. NeurIPS, 2022.
- Llemma — Open language model for mathematics — Azerbayev et al. ICLR, 2024.
- DeepSeekMath — Shao et al. arXiv:2402.03300, 2024. — introduces GRPO.
- Qwen2.5-Math — Yang et al. arXiv:2409.12122, 2024.
- InternLM2-Math (Plus) — Ying et al. arXiv:2402.06332, 2024.
- MetaMath — Yu et al. ICLR, 2024.
- MAmmoTH — Yue et al. ICLR, 2024.
- WizardMath — Luo et al. arXiv:2308.09583, 2023.
- Solving MWPs with process- and outcome-based feedback — Uesato et al. arXiv:2211.14275, 2022.
- Let's verify step by step (PRM800K) — Lightman et al. ICLR, 2024.
- Math-Shepherd — Verify and reinforce without human annotations — Wang et al. ACL, 2024.
- OmegaPRM — Automated process supervision — Luo et al. arXiv:2406.06592, 2024.
- OpenAI o1 — Learning to Reason with LLMs — OpenAI tech report, 2024.
- DeepSeek-R1 — Incentivizing reasoning via RL — Guo et al. arXiv:2501.12948, 2025.
- Kimi k1.5 — Scaling RL with LLMs — Kimi Team. arXiv:2501.12599, 2025.
- s1 — Simple test-time scaling — Muennighoff et al. arXiv:2501.19393, 2025.
- Scaling LLM test-time compute optimally — Snell et al. arXiv:2408.03314, 2024.
- SU-01 — Gold-medal-level olympiad reasoning via simple and unified scaling — Li, Zhan, Zhang et al. arXiv:2605.13301, 2026. — 30B-A3B model, gold on IMO 2025 (35/42) and USAMO 2026 (35/42).
- PPO — Proximal Policy Optimisation — Schulman et al. arXiv:1707.06347, 2017.
- ReMax — REINFORCE-style baseline for LLM alignment — Li et al. ICML, 2024.
- RLOO — Back to basics: REINFORCE-style optimisation for RLHF — Ahmadian et al. ACL, 2024.
- REINFORCE++ — Hu et al. arXiv:2501.03262, 2025.
- GRPO (introduced with DeepSeekMath) — Shao et al. arXiv:2402.03300, 2024. — critic-free group-normalised advantages.
- DAPO — Open-source LLM RL system at scale — Yu et al. NeurIPS, 2025.
- DPO — Direct Preference Optimisation — Rafailov et al. NeurIPS, 2023.
- Step-DPO — Lai et al. arXiv:2406.18629, 2024.
- GEOS — Solving geometry problems: combining text and diagram interpretation — Seo et al. EMNLP, 2015.
- GEOS++ — From textbooks to knowledge — Sachan, Dubey, Xing. EMNLP, 2017.
- GeoShader — Synthesis of solutions for shaded-area problems — Alvin et al. FLAIRS, 2017.
- GEOS-OS — Learning to solve geometry problems from textbook demonstrations — Sachan & Xing. *SEM, 2017.
- Wu's method — On the decision problem and mechanisation of theorem-proving in elementary geometry — Wu. Scientia Sinica, 1978.
- Wu's method revisited (Sinha et al.) — NeurIPS, 2024. — careful JGEX reimplementation solves 15/30 IMO-AG-30; Wu + AG = 27/30.
- InterGPS — Interpretable geometry problem solving — Lu et al. arXiv:2105.04165, 2021.
- GeoQA — Multimodal numerical reasoning benchmark — Chen et al. arXiv:2105.14517, 2021.
- PGPSNet — Multi-modal neural geometric solver — Zhang et al. IJCAI, 2023.
- GeoDRL — Self-learning framework using RL in deductive reasoning — Peng et al. Findings of ACL, 2023.
- LANS — Layout-aware neural solver for plane geometry — Li et al. Findings of ACL, 2024.
- Pi-GPS — Unleashing the power of diagrammatic information — Zhao et al. ICCV, 2025.
- FGeo-HyperGNet — FormalGeo + hypergraph neural network — Zhang et al. IJCAI, 2025.
- MARS-GPS — Multi-CoT voting for geometric reasoning — Siddique et al. arXiv:2604.00890, 2026. — 88.8/77.48 on Geo3K/PGPS9K.
- AlphaGeometry — Solving olympiad geometry without human demonstrations — Trinh et al. Nature 625, 2024. — 25/30 on IMO-AG-30.
- AlphaGeometry 2 — Gold-medalist performance — Chervonyi et al. arXiv:2502.03544, 2025. — 30/30 (IMO-AG-30 full setup), 42/50 (IMO-AG-50).
- TongGeometry — Guided tree search — Zhang et al. Nature Machine Intelligence, 2026. — 30/30 on IMO-AG-30 with consumer hardware.
- G-LLaVA — Solving geometric problems with a multimodal LLM — Gao et al. arXiv:2312.11370, 2023.
- MAVIS — Mathematical visual instruction tuning — Zhang et al. arXiv:2407.08739, 2024.
- MINT-CoT — Interleaved visual tokens in mathematical CoT — Chen et al. NeurIPS, 2025.
- R1-V — Reinforcing super-generalisation in VLMs — Chen et al. arXiv:2501.12015, 2025.
- MM-Eureka — Visual aha moment via rule-based RL — Meng et al. arXiv:2503.07365, 2025.
- GeoGRPO — Stepwise-GRPO in RLHF for geometry — Jiang et al. LNCS Springer, 2025. — 80.6 GeoQA / 73.4 MathVista.
- Qwen2-VL — Wang et al. arXiv:2409.12191, 2024.
- Qwen2.5-VL — Bai et al. arXiv:2502.13923, 2025.
- GPT-f — Generative language modeling for automated theorem proving — Polu & Sutskever. arXiv:2009.03393, 2020. — 700M parameters.
- HyperTree Proof Search — Lample et al. NeurIPS, 2022. — AlphaZero-style MCTS over proof states.
- Draft, Sketch and Prove (DSP) — Jiang et al. ICLR, 2023. — informal-proof-guided formalisation.
- Baldur — Whole-proof generation and repair — First et al. FSE, 2023.
- LeanDojo — Theorem proving with retrieval-augmented LMs — Yang et al. NeurIPS D&B, 2023. — 26.5% Pass@1 on miniF2F-test.
- Lean Copilot — LLMs as copilots for Lean — Song, Yang, Anandkumar. arXiv:2404.12534, 2024.
- Lean-STaR — Learning to interleave thinking and proving — Lin et al. ICLR, 2025. — 46.3% Pass@64 on miniF2F.
- InternLM2-Math-Plus — Ying et al. 2024. — 43.4% Pass@32 on miniF2F.
- DeepTheorem — Theorem proving via NL + RL — Zhang et al. arXiv:2505.23754, 2025.
- DeepSeek-Prover V1 — Xin et al. arXiv:2405.14333, 2024. — 50.0% miniF2F (Pass@65k).
- DeepSeek-Prover V1.5 — Xin et al. ICLR, 2025. — 63.5% miniF2F via RMaxTS.
- DeepSeek-Prover V2 — Ren et al. arXiv:2504.21801, 2025. — 88.9% miniF2F, 49/658 PutnamBench (671B).
- Kimina-Prover Preview — Wang et al. arXiv:2504.11354, 2025. — 80.7% miniF2F (72B).
- Goedel-Prover V1 — Lin et al. arXiv:2502.07640, 2025. — 57.6% miniF2F, 7/658 PutnamBench.
- Goedel-Prover V2 — Lin et al. arXiv:2508.03613, 2025. — 90.4% miniF2F, 86/658 PutnamBench (32B).
- AlphaProof — Olympiad-level formal reasoning with RL — Hubert et al. Nature 651, 2026. — silver-tier IMO 2024 (28/42, including P6).
- Aristotle — IMO-level automated theorem proving — Achim et al. arXiv:2510.01346, 2025. — used in Erdős solves.
- Autoformalisation with large language models — Wu et al. NeurIPS, 2022.
- Improving autoformalisation using type checking — Poiroux et al. arXiv:2406.07222, 2024.
- PDA — Process-Driven Autoformalisation in Lean 4 — Jiang et al. arXiv:2406.01940, 2024.
- MathlibLemma — Folklore lemma generation — Liu et al. arXiv:2602.02561, 2026. — multi-agent Discovery/Judge/Formaliser.
- The mathlib library — Lean mathlib community.
- APOLLO — Automated LLM + Lean collaboration for proof repair — Sun et al. arXiv:2505.05758, 2025. — 84.9% miniF2F at ≤8B.
- FunSearch — Mathematical discoveries from program search — Romera-Paredes et al. Nature 625, 2024. — cap-set + bin-packing.
- AlphaEvolve — A coding agent for scientific and algorithmic discovery — Novikov et al. arXiv:2506.13131, 2025. — improved Strassen on 4×4 complex matrices (48 mult.).
- Mathematical exploration and discovery at scale — Georgiev, Gómez-Serrano, Tao, Wagner. arXiv:2511.02864, 2025. — 67 problems across analysis, combinatorics, geometry, number theory.
- The Erdős Problems Website — Bloom. 2024. — 1,133 catalogued conjectures.
- AI Contributions to Erdős Problems — Alexeev et al. community wiki, 2025–2026. — solves of #1026, #728, #729, #397.
- Aristotle (used by B. Alexeev for #1026) — Achim et al. 2025.
- AI co-mathematician — Accelerating mathematicians with agentic AI — Zheng et al. arXiv:2605.06651, 2026. — Google's hierarchical agent workbench; 48% on FrontierMath Tier 4; solves Kourovka 21.10, Stirling log-concavity, Hamiltonian perturbation lemma.
- Aletheia — Towards autonomous mathematics research — Feng et al. arXiv:2602.10177, 2026.
| Dataset | Size | Lang. | Year | Paper |
|---|---|---|---|---|
| AI2 (AddSub) | 395 | EN | 2014 | |
| SingleEQ | 508 | EN | 2015 | |
| Alg514 | 514 | EN | 2014 | |
| Dolphin18K | 18,460 | EN | 2016 | |
| MAWPS | 3,320 | EN | 2016 | |
| ASDiv | 2,305 | EN | 2020 | |
| AQuA | 97,975 | EN | 2017 | |
| Math23K | 23,162 | ZH | 2017 | |
| Ape210K | 210,488 | ZH | 2020 | |
| MathQA | 37,297 | EN | 2019 | |
| SVAMP | 1,000 | EN | 2021 | |
| ParaMAWPS | 16,278 | EN | 2023 | |
| GSM-Symbolic | 5,000 | EN | 2024 |
| Dataset | Size | Year | Paper |
|---|---|---|---|
| GSM8K | 8,792 | 2021 | |
| MATH | 12,500 | 2021 | |
| OlympiadBench | 8,476 | 2024 | |
| Omni-MATH | 4,428 | 2024 | |
| FrontierMath | ~350 | 2024 | |
| MathArena | 149+/yr (live) | 2025 | |
| IMO-ProofBench | 60 | 2025 |
| Dataset | Coverage | Year | Paper |
|---|---|---|---|
| MGSM | 10 langs · 2.5K | 2022 | |
| HAWP | Hindi · 2.3K | 2022 | |
| ArMATH | Arabic · 6K | 2022 | |
| CMATH | Chinese · 1.7K | 2023 | |
| MathOctopus | 10 langs · 73.6K | 2023 | |
| HRM8K | Korean–English · 8K | 2025 | |
| PatiGonit | Bengali · 10K | 2025 | |
| BMWP | Bengali · 8.7K | 2025 | |
| PolyMath | 18 langs · ~9K | 2025 | paper |
| MathMist | 13 langs · ~30K | 2026 | |
| M3Kang | 108 langs · 108K variants | 2026 |
- TabMWP — Lu et al. ICLR, 2023. — 38,431 grade-school problems with tables.
- FinQA — Chen et al. EMNLP, 2021. — 8,281 financial QA.
- TAT-QA — Zhu et al. ACL, 2021. — 16,552 hybrid table+text questions.
- MultiHiertt — Zhao et al. ACL, 2022. — 10,440 multi-hierarchy tables.
- MultiTabQA — Pal et al. ACL, 2023.
- Chameleon — Lu et al. NeurIPS, 2024.
| Dataset | Size | Year | Paper |
|---|---|---|---|
| GEOS | 186 | 2015 | |
| GEOS++ | 1,406 | 2017 | |
| GeoShader | 102 | 2017 | |
| GEOS-OS | 2,235 | 2017 | |
| Geometry3K | 3,002 | 2021 | |
| GeoQA | 4,998 | 2021 | |
| GeoQA+ | 12,054 | 2022 | paper |
| UniGeo | 14,541 | 2022 | |
| PGPS9K | 9,022 | 2023 | |
| MathVista | 6,141 | 2024 | |
| MathVerse | 2,612 (×6 variants) | 2024 | |
| MATH-Vision | 3,040 | 2024 | |
| MV-MATH | 2,009 | 2025 | |
| We-Math | 6,500 | 2024 |
- MiniF2F — Zheng, Han, Polu. 2022. — 488 problems, 4 proof systems.
- ProofNet — Azerbayev et al. 2023. — 371 undergrad theorems with informal/formal pairs.
- PutnamBench — Tsoukalas et al. NeurIPS D&B, 2024. — 658 Putnam competition problems.
- Lean Workbook — Ying et al. NeurIPS D&B, 2024. — 57,231 autoformalised problems.
- DeepTheorem — Zhang et al. 2025. — 121K NL+formal proofs.
- FrontierMath — Glazer et al. arXiv:2411.04872, 2024. — research-grade, private solutions.
- HLE — Humanity's Last Exam — CAIS, Scale AI, HLE consortium. Nature 649, 2026. — 2,500 expert-authored items.
- MathArena — Balunović et al. arXiv:2505.23281, 2025. — live contamination-free leaderboard.
- MMLU — Hendrycks et al. ICLR, 2021.
- MMLU-Pro — Wang et al. NeurIPS D&B, 2024.
- GPQA — Rein et al. arXiv:2311.12022, 2023.
- SuperGPQA — M-A-P Team. arXiv:2502.14739, 2025.
- LiveBench — White et al. ICLR, 2025.
- PRM800K — Lightman et al. ICLR, 2024. — 800K step-level labels.
- Math-Shepherd — Wang et al. ACL, 2024. — annotation-free MC step estimation.
- OmegaPRM — Luo et al. arXiv:2406.06592, 2024. — 1.5M+ labels.
- MetaMathQA — Yu et al. 2024. — 395K backward-rewritten questions.
- NuminaMath — 2024. — 860K+ multi-source competition SFT data.
- SciBench — Wang et al. ICML, 2024. — 869 college-level physics/chemistry/calculus problems.
- SciEval — Sun et al. AAAI, 2024. — 18K research-paper comprehension + experimental reasoning items.
- ScienceQA — Lu et al. NeurIPS, 2022. — 21K multimodal grade-school science items.
Topics treated in the survey's failure-mode section:
- Robustness and spurious correlations — Patel 2021 (SVAMP), Mirzadeh 2024 (GSM-Symbolic).
- Metric mismatch and path optimality — Pass@1 vs Pass@k vs proof compilation; recommend reporting a process-sensitive measure alongside answer accuracy.
- Benchmark contamination and the race to saturation — MathArena evidence of 10–20pp inflation on AIME 2024 vs AIME 2025.
- Reward hacking in RLVR training.
- Multimodal-specific failure modes — MLLMs improving when the diagram is removed (MathVerse).
- Hallucination in mathematical derivations and proofs.
- Language transfer and localisation — HRM8K, PolyMath gaps.
- Multi-agent coordination and correlated errors — reviewer-pleasing bias / false consensus, non-termination death spirals, LaTeX-typesetting-creates-false-rigor (from the AI co-mathematician deployment).
- Energy, carbon, and access.
- Machine-Assisted Proof — Tao. 2024.
- AI Will Become Mathematicians' Co-Pilot — Tao. Scientific American interview, 2024.
- AI Is Ready for Primetime in Math and Theoretical Physics — Tao. IPAM talk, OpenAI Academy, 2026.
- Mathematical methods and human thought in the age of AI — Klowden & Tao. Blackwell Companion to the Philosophy of Mathematics (to appear), 2026. — blog post
| Authors | Title | Scope | Year | Venue | Paper |
|---|---|---|---|---|---|
| Zhang et al. | The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers | MWP solvers (statistical → neural) | 2018 | arXiv | |
| Mukherjee & Garain | A Review of Methods for Automatic Understanding of Natural-Language Mathematical Problems | Pre-DL MWP understanding | 2008 | AI Review | Springer |
| Sundaram et al. | Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers | Neural MWP solvers | 2022 | arXiv | |
| Lu et al. | A Survey of Deep Learning for Mathematical Reasoning | Deep-learning era MWP/proving/geometry | 2023 | ACL | |
| Ahn et al. | Large Language Models for Mathematical Reasoning: Progresses and Challenges | LLM-era reasoning | 2024 | EACL SRW | |
| Liu et al. | Mathematical Language Models: A Survey | Math-specialised LLMs and pre-training | 2025 | ACM Comp. Surveys | DOI |
| Wang et al. | A Survey on Large Language Models for Mathematical Reasoning | LLM reasoning techniques (CoT, RL, tools) | 2026 | ACM Comp. Surveys | DOI |
| Yang et al. | Formal Mathematical Reasoning: A New Frontier in AI | Formal proving with LMs | 2024 | arXiv | |
| Li et al. | A Survey on Deep Learning for Theorem Proving | Neural / RL-based theorem provers | 2024 | COLM | |
| Yan et al. | A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model | Multimodal math (MLLM) | 2025 | Findings of ACL | DOI |
| Guo et al. | Large Language Model based Multi-Agents: A Survey of Progress and Challenges | Multi-agent LLM systems (incl. math) | 2024 | arXiv |
If this survey or this companion list helps your research, please cite:
@article{raiyan2026ai4mathsurvey,
title = {Artificial Intelligence for Mathematical Reasoning:
An Integrated Survey of Language Models, Neuro-symbolic Systems,
and Verified Discovery},
author = {Raiyan, Syed Rifat and Kabir, Mohsinul and Mahmud, Hasan and Hasan, Md Kamrul},
journal = {arXiv preprint arXiv:2606.08728},
year = {2026}
}Contributions are welcome — please open a PR that:
- Adds the paper under the most specific category it fits.
- Uses the format
**[Title](pdf-url)** — Authors. *Venue Year.* - Keeps the table-of-contents anchors in sync.
- For new datasets, adds a row to the appropriate dataset table (with a
[PDF]link). - For new model results that update an SOTA cell in the Benchmark Saturation table, also note the protocol (Pass@1 / Pass@k / tree search / TTRL).
If you'd like to suggest a structural change (new section, re-categorisation, alternative ordering), open an issue first so we can discuss.
This repository's content (the curated list, summaries, and figures) is released under the MIT License. Linked papers retain their original copyrights.
