GitHub - zjunlp/predict-before-execute: Can We Predict Before Executing Machine Learning Agents?

Can We Predict Before Executing Machine Learning Agents?

📄Paper • 🤗HFPaper • 📦Data & Runtime • 𝕏Blog • 📕小红书

This repository contains the official implementation and data for the paper "Can We Predict Before Executing Machine Learning Agents?".

Traditional machine learning agents rely on an iterative "Generate-Execute-Feedback" loop, suffering from a severe Execution Bottleneck. Addressing this, we ask: Can we compress hours of physical execution into seconds of logical inference?

To answer this, we introduce Data-centric Solution Preference—predicting solution performance without physical execution by leveraging "Implicit Execution Priors". We operationalize this in FOREAGENT, an agent employing a Predict-then-Verify loop to decouple exploration from execution.

Our work demonstrates that LLMs exhibit significant predictive capabilities. FOREAGENT achieves a 6× acceleration, expands the search space by 3.2×, and yields a +6% performance gain over baselines. We also release a foundational dataset of verified execution trajectories.

📰 News

[2026-02-08] We have released all code and data.
[2026-01-09] We release our paper Can We Predict Before Executing Machine Learning Agents?.

🔍 Overview

To answer the titular question, we structured our research into three key phases, corresponding to the modules in this repository:

Corpus Construction (prepare_bench_subset/) 🏗️
- We defined the Data-centric Solution Preference task and constructed a large-scale corpus of 18,438 pairwise comparisons derived from real-world execution trajectories (AIDE & AutoMind) on MLE-bench.
- This module handles solution sampling, cleaning, execution, and ground-truth generation.
- 👉 prepare_bench_subset/README.md
Predictive Evaluation (grade/) 📊
- We extensively evaluated LLMs (e.g., DeepSeek-V3.2, GPT-5.1) on this corpus to determine if they possess "Implicit Execution Priors".
- We found that reasoning-optimized models significantly outperform random guessing and complexity heuristics.
- 👉 grade/README.md
FOREAGENT Application (mle-bench/) 🤖
- Leveraging these findings, we developed FOREAGENT, an autonomous agent that integrates the "Predict-then-Verify" loop.
- By using the LLM as an implicit world model to prune the search space, it decouples exploration from expensive physical execution.
- 👉 mle-bench/README.md

🚀 Results

1. Feasibility of Run-Free Preference

Our experiments on the curated corpus demonstrate that LLMs exhibit significant predictive capabilities, transcending simple heuristics.

Model	Accuracy	vs. Random (50%)	vs. Heuristic (50.8%)
DeepSeek-V3.2-Thinking	61.5%	✅ Significant	✅ Significant
GPT-5.1	58.8%	✅ Significant	✅ Significant

Stratified pairwise accuracy results proving LLMs derive valid signals from static inputs through genuine reasoning.

2. FOREAGENT Performance

Integrating this predictive mechanism into the FOREAGENT application yields substantial efficiency and performance gains:

⚡ 6× Speedup: Converges to peak performance using only 1/6 of the execution time compared to baselines.
🔍 3.2× Broader Search: Explores significantly more solution candidates within the same time budget.
📈 +6% Beat Ratio: achieves a higher win rate against human leaderboard contestants across diverse AI4Science tasks (e.g., Biology, Physics, Geoscience).

📦 Data & Runtime

We provide the curated solution corpora, analysis subsets, agent-run trajectories, and task resources used in our experiments.

📥 Download Data (Google Drive)

The data package includes (but is not limited to):

solutions_all/: The full unfiltered solution corpus.
solutions_subset_50/: The main experiment subset (capped at 50 solutions per task).
agent_runs/: Full trajectories and logs from AIDE and ForeAgent runs.
tasks/: Shared data hub containing competition data, descriptions, and analysis reports.

For a comprehensive description, please refer to the data/README.md file included in the downloaded package.

🛠️ Environment Setup

1. Corpus Construction & Prediction Benchmark

To build the corpus or run the prediction benchmark, please set up the environment as follows:

conda create -n pbe python=3.10
conda activate pbe
pip install -r requirement.txt

# Set environment variables
export OPENAI_API_KEY="<your_openai_api_key>"
export OPENAI_BASE_URL="<your_openai_base_url>"
export OPENAI_MODEL="DeepSeek-V3.2"

Corpus Construction: Navigate to prepare_bench_subset.
Prediction Benchmark: Navigate to grade.

2. ForeAgent (Agent Execution)

Our agent implementation is based on MLE-bench. Please refer to mle-bench/README.md for detailed installation and execution instructions.

🙏 Acknowledgement

This project builds upon and expresses gratitude to the following open-source projects:

👥 Contributors

We will offer long-term maintenance to fix bugs for the project. Issues and PRs are welcome!

📝 Citation

If you find this work useful in your research, please cite our paper:

@misc{zheng2026predictexecutingmachinelearning,
      title={Can We Predict Before Executing Machine Learning Agents?}, 
      author={Jingsheng Zheng and Jintian Zhang and Yujie Luo and Yuren Mao and Yunjun Gao and Lun Du and Huajun Chen and Ningyu Zhang},
      year={2026},
      eprint={2601.05930},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.05930}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
__pycache__		__pycache__
assets		assets
grade		grade
mle-bench		mle-bench
prepare_bench_subset		prepare_bench_subset
README.md		README.md
backend.py		backend.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can We Predict Before Executing Machine Learning Agents?

📄Paper • 🤗HFPaper • 📦Data & Runtime • 𝕏Blog • 📕小红书

📰 News

📖 Contents

🔍 Overview

🚀 Results

1. Feasibility of Run-Free Preference

2. FOREAGENT Performance

📦 Data & Runtime

🛠️ Environment Setup

1. Corpus Construction & Prediction Benchmark

2. ForeAgent (Agent Execution)

🙏 Acknowledgement

👥 Contributors

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 13

Uh oh!

Languages

zjunlp/predict-before-execute

Folders and files

Latest commit

History

Repository files navigation

Can We Predict Before Executing Machine Learning Agents?

📄Paper • 🤗HFPaper • 📦Data & Runtime • 𝕏Blog • 📕小红书

📰 News

📖 Contents

🔍 Overview

🚀 Results

1. Feasibility of Run-Free Preference

2. FOREAGENT Performance

📦 Data & Runtime

🛠️ Environment Setup

1. Corpus Construction & Prediction Benchmark

2. ForeAgent (Agent Execution)

🙏 Acknowledgement

👥 Contributors

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Uh oh!

Languages

Packages