Skip to content

zjunlp/predict-before-execute

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Can We Predict Before Executing Machine Learning Agents?

License: MIT


This repository contains the official implementation and data for the paper "Can We Predict Before Executing Machine Learning Agents?".

Traditional machine learning agents rely on an iterative "Generate-Execute-Feedback" loop, suffering from a severe Execution Bottleneck. Addressing this, we ask: Can we compress hours of physical execution into seconds of logical inference?

To answer this, we introduce Data-centric Solution Preference—predicting solution performance without physical execution by leveraging "Implicit Execution Priors". We operationalize this in FOREAGENT, an agent employing a Predict-then-Verify loop to decouple exploration from execution.

Our work demonstrates that LLMs exhibit significant predictive capabilities. FOREAGENT achieves a 6× acceleration, expands the search space by 3.2×, and yields a +6% performance gain over baselines. We also release a foundational dataset of verified execution trajectories.

📰 News

📖 Contents

🔍 Overview

To answer the titular question, we structured our research into three key phases, corresponding to the modules in this repository:

  1. Corpus Construction (prepare_bench_subset/) 🏗️

    • We defined the Data-centric Solution Preference task and constructed a large-scale corpus of 18,438 pairwise comparisons derived from real-world execution trajectories (AIDE & AutoMind) on MLE-bench.
    • This module handles solution sampling, cleaning, execution, and ground-truth generation.
    • 👉 prepare_bench_subset/README.md
  2. Predictive Evaluation (grade/) 📊

    • We extensively evaluated LLMs (e.g., DeepSeek-V3.2, GPT-5.1) on this corpus to determine if they possess "Implicit Execution Priors".
    • We found that reasoning-optimized models significantly outperform random guessing and complexity heuristics.
    • 👉 grade/README.md
  3. FOREAGENT Application (mle-bench/) 🤖

    • Leveraging these findings, we developed FOREAGENT, an autonomous agent that integrates the "Predict-then-Verify" loop.
    • By using the LLM as an implicit world model to prune the search space, it decouples exploration from expensive physical execution.
    • 👉 mle-bench/README.md

🚀 Results

1. Feasibility of Run-Free Preference

Our experiments on the curated corpus demonstrate that LLMs exhibit significant predictive capabilities, transcending simple heuristics.

Model Accuracy vs. Random (50%) vs. Heuristic (50.8%)
DeepSeek-V3.2-Thinking 61.5% ✅ Significant ✅ Significant
GPT-5.1 58.8% ✅ Significant ✅ Significant

Stratified pairwise accuracy results proving LLMs derive valid signals from static inputs through genuine reasoning.

2. FOREAGENT Performance

Integrating this predictive mechanism into the FOREAGENT application yields substantial efficiency and performance gains:

  • 6× Speedup: Converges to peak performance using only 1/6 of the execution time compared to baselines.
  • 🔍 3.2× Broader Search: Explores significantly more solution candidates within the same time budget.
  • 📈 +6% Beat Ratio: achieves a higher win rate against human leaderboard contestants across diverse AI4Science tasks (e.g., Biology, Physics, Geoscience).

📦 Data & Runtime

We provide the curated solution corpora, analysis subsets, agent-run trajectories, and task resources used in our experiments.

📥 Download Data (Google Drive)

The data package includes (but is not limited to):

  • solutions_all/: The full unfiltered solution corpus.
  • solutions_subset_50/: The main experiment subset (capped at 50 solutions per task).
  • agent_runs/: Full trajectories and logs from AIDE and ForeAgent runs.
  • tasks/: Shared data hub containing competition data, descriptions, and analysis reports.

For a comprehensive description, please refer to the data/README.md file included in the downloaded package.

🛠️ Environment Setup

1. Corpus Construction & Prediction Benchmark

To build the corpus or run the prediction benchmark, please set up the environment as follows:

conda create -n pbe python=3.10
conda activate pbe
pip install -r requirement.txt

# Set environment variables
export OPENAI_API_KEY="<your_openai_api_key>"
export OPENAI_BASE_URL="<your_openai_base_url>"
export OPENAI_MODEL="DeepSeek-V3.2"

2. ForeAgent (Agent Execution)

Our agent implementation is based on MLE-bench. Please refer to mle-bench/README.md for detailed installation and execution instructions.

🙏 Acknowledgement

This project builds upon and expresses gratitude to the following open-source projects:

👥 Contributors

We will offer long-term maintenance to fix bugs for the project. Issues and PRs are welcome!

📝 Citation

If you find this work useful in your research, please cite our paper:

@misc{zheng2026predictexecutingmachinelearning,
      title={Can We Predict Before Executing Machine Learning Agents?}, 
      author={Jingsheng Zheng and Jintian Zhang and Yujie Luo and Yuren Mao and Yunjun Gao and Lun Du and Huajun Chen and Ningyu Zhang},
      year={2026},
      eprint={2601.05930},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.05930}, 
}

Releases

No releases published

Packages

No packages published

Contributors 13