This repository contains the data-generation and LLM-testing code associated with the paper "ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark".
ASyMOB_Generation.py generates a diverse set of mathematical question variants from a seed CSV file. It leverages the sympy library for symbolic mathematics to create various perturbations of original questions, including symbolic, numeric, and equivalence-based transformations. The generated questions are then saved to a JSON file.
- Python 3.7+
sympylibrary (pip install sympy)- csv, json, re, random, itertools
-
Prepare your seed data: Ensure you have a CSV file named
Seed and Symbolic Questions.csvin the same directory as the script. This CSV should contain the seed mathematical questions, their maximal symbolic perturbations, and answers as SymPy expressions.The expected fields in
Seed and Symbolic Questions.csvare:Challenge: The mathematical question in LaTeX format, including assumptions regarding variables or other mathematical details.Answer in LaTeX(optional): The answer to the question, represented as a LaTeX string.Answer in Sympy: The answer to the question, represented as a SymPy expression string.Variation: "Original" or "Symbolic".Source: Identifies the origin of the question.Category: Question type (e.g. Integrals, Limits, etc.).
-
Run the script:
python ASyMOB_Generation.py -
Output: The script will generate a JSON file named
Full_ASyMOB_Dataset.jsonin the same directory. This file will contain all the original seed and symbolic questions along with their newly generated symbolic, numeric, and equivalence-based variants.The resulting fields in
Full_ASyMOB_Dataset.jsonare:Index: Sequential ID.Challenge: The mathematical question in LaTeX format, including assumptions regarding variables or other mathematical details.Answer in Sympy: The answer to the question, represented as a SymPy expression string.Variation: e.g., Equivalence-All-Hard, Numeric-One-3, etc.Source: Same as the seed question from which this variation originated.Category: Same as the seed question from which this variation originated.
Seed and Symbolic Questions.csv: Modify this CSV to add new seed questions or adjust existing ones.symnoise_char_list,symnoise_sym_list: Adjust the lists of symbolic characters and their SymPy representations if your questions use different symbols for perturbation (ASyMOB uses 'A', 'B', 'F', 'G', 'H' by default).equivalent_forms_easy,equivalent_forms_hard: Add or modify the equivalent forms to introduce different types of mathematical equivalences.noise_digitsandreps_num: Ingenerate_NA2S, you can changenoise_digitsto control the range of random numbers used for numeric perturbations andreps_numto control the number of repetitions for each item.
The ASyMOB evaluation pipeline is a set of Python scripts that query different LLMs to solve the generated questions, then validate the answers both symbolically and numerically. The pipeline consists of the following scripts:
- Answer collection scripts: These scripts collect answers from the LLMs for the questions generated in the previous step, using different APIs.
collect_llm_answers.py: Uses the Responses or Completions API to query LLMs one question at a time. Implementations for different LLMs are provided in theinterfacefiles.openai_send_batch_query.py: Uses the BATCH API to query OpenAI's models in bulk. This implementation reduces execution time and costs.openai_assistants.py: Uses the OpenAI Assistant API for querying. This allows forcing the LLM to spawn a Python interpreter (server side) and use it for question solving.
- Answer validation script:
validate_answers_rowwise.pyvalidates the collected answers using symbolic and numerical validation. Numeric validation relies on numerical substitution of the variables in the answers. A set of valid substitutions is generated using thecreate_subs.pyscript. Symbolic validation relies on the SymPy library to validate the answers symbolically.
The evaluation pipeline maintains a database of the answers collected and their validations. The database's SQL definition is provided in the db_schema.sql file.
Sample tables are provided in the sample_data folder. You can use these samples by uncommenting the relevant lines in the respective files.
To collect the answers, run:
python collect_llm_answers.py
To validate the answers, run:
python validate_answers_rowwise.py
The output of the validation pipeline is stored in a database. The pipeline's
results are stored in the view pipeline_results. The columns of the view are:
- challenge_id - The ID of the challenge in the benchmark dataset.
- variation - The variation type of the challenge, e.g., "Numeric-One-3", "Symbolic-All".
- source - The source of the challenge.
- true_answer_sympy - The correct answer to the challenge in SymPy format.
- model - The model used to answer the challenge, e.g., "o4-mini", "gpt-4o".
- code_execution - Indicates whether the prompt used incentivized the model to execute code or disallowed it.
Noneindicates neither. - full_answer - The full answer provided by the model, including any code execution output.
- final_answer_latex - The final answer provided by the model in LaTeX format, extracted from the full answer.
- tokens_used - The number of tokens used by the model to answer the query.
- symbolic_correct - Whether the answer was successfully validated symbolically.
- numeric_correc - Whether the answer was successfully validated numerically.
If you use ASyMOB in your research, please cite the paper:
@misc{ASyMOB,
title={ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark},
author={Michael Shalyt and Rotem Elimelech and Ido Kaminer},
year={2025},
eprint={2505.23851},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.23851},
}