prepare_bench_subset is a toolkit for sampling a subset of solutions from the original unfiltered pool (solutions_all) and then running a series of preparation steps (cleaning, execution, grading, and grouping) to construct the Data-centric Solution Preference corpus.
It exposes several CLI utilities that together form a pipeline:
-
🍒 Subset extraction
extract_subset.py: select solutions according to a semantic keyword specification intasks.json.extract_random_subset.py: randomly sample.pysolutions per task.
-
🧹 (Optional) LLM-based cleanup / fixing
clean.py: compile-check and optionally auto-fix Python solutions using LLM-based tools.
-
🏃 Run solutions
run.py: execute solutions (single run or batched in Docker) to produce groundtruth submission.
-
📝 Grade submissions
grade.py: grade generatedsubmission.csvfiles, either for a single file, a JSONL batch, or auto-discovered runs/solutions.
-
⚖️ Build evaluation groups
group.py: readeval_output.jsonscores and build n-way comparison groups, with optional balancing of the best-index positions.
The typical workflow is:
extract subset → (optionally) clean / fix solutions → run them → grade results (using a grader adapted from mle-bench) → group solutions for evaluation.
Structure is key! A minimal expected layout looks like:
solutions_root/
├── <task_name>/
│ ├── annotation/
│ │ ├── annotations_semantic.json # Semantic tags
│ │ └── keywords_by_rank.json # Stats
│ ├── code/
│ │ ├── solution_*.py # Source code
│ │ └── submission_solution_*/ # Execution artifacts
│ │ ├── submission.csv
│ │ ├── exec_output.txt
│ │ └── eval_output.json # Scores
│ ├── ground_truth/
│ │ └── groups_<task_name>_n.json # Comparison pairs (The output)
│ ├── output/output_*.txt
│ └── report/
│ ├── alignment_*.json
│ └── grade_report_*.txt
This section explains what each stage in the pipeline does, which inputs it expects, and what artifacts it produces.
extract_subset.py selects solutions semantically according to the specification in tasks.json and the per-solution annotations, then copies the selected code, outputs, and (optionally) rich evaluation artifacts.
Semantic selection (driven by tasks.json):
python -m prepare_bench_subset.extract_subset \
--solutions-root /path/to/full_solutions \
--subset-root /path/to/subset_solutions \
--tasks-json /path/to/tasks.jsonRandom selection (up to N .py files per task):
python -m prepare_bench_subset.extract_random_subset \
--solutions_root /path/to/full_solutions \
--tasks_file /path/to/task_name.txt \
--out_root /path/to/random_subset_solutions \
--per_task 10 \
--seed 42clean.py runs a multi-phase LLM-assisted pipeline over all Python files. It includes:
- Eval-time fixing (
eval_fix.py): Fixes solutions rejected during grading. - GPU boosting rewrite (
gpu_rewrite.py): Rewrites applicable algorithms (e.g., LightGBM) to use GPU. - Runtime-error fixing (
runtime_fix.py): Fixes bugs based on execution logs. - Compile & syntax checking (
compile_fix.py): Fixes syntax errors.
python -m prepare_bench_subset.clean \
--root /path/to/subset_solutions \
--workers 32 \
--max-depth 3 \
--eval-error-json /path/to/error_eval.json \
--gpu-boosting-kw-file /path/to/boosting_kw_algo.txt \
--runtime-log /path/to/runtime_logs.log \
--data-root tasks/data \
--verbose-log /path/to/verbose_logs.log(Note: Flags like --runtime-log are optional depending on which fixers you want to run.)
run.py executes solutions in Docker to produce submission.csv.
- Batch mode (
--batch): For mass execution. requiresdocker pull johnsonzheng03/predict-before-execute. - Single-run mode: For debugging one file.
python -m prepare_bench_subset.run \
--batch \
--solutions-root /path/to/subset_solutions \
--task-file /path/to/task_name.txt \
--dockerfile prepare_bench_subset/env/Dockerfile \
--data-dir tasks/data \
--max-parallel 8 \
--clean-links \
--clean-working \
--clean-workspacegrade.py evaluates submission.csv files and writes eval_output.json. It is adapted from MLE-bench.
auto-grade: Automatically finds submissions under a solution root or agent run directory.
python -m prepare_bench_subset.grade auto-grade \
--task-list /path/to/task_name.txt \
--solutions-dir /path/to/subset_solutions \
--data-dir tasks/data \
--competitions-dir tasks/competitions \
--workers 64 \
--error-report /path/to/error_report.json \
--allow-zero-scoregroup.py converts scores into n-way comparison groups (Ground Truth) for the preference task. It filters invalid scores and ensures balanced positions for the "winner".
Example Group Format:
[
{
"paths": ["path/to/sol_A.py", "path/to/sol_B.py"],
"scores": [0.85, 0.92],
"best_index": 1,
"full_ranking": [1, 0],
"is_lower_better": false
}
]Build Command:
python -m prepare_bench_subset.group \
--task-file /path/to/task_name.txt \
--solutions-root /path/to/subset_solutions \
--group-size 2 \
--balanced \
--seed 42Below is a complete example of the pipeline:
# 1) Extract a semantic subset
python -m prepare_bench_subset.extract_subset \
--solutions-root /path/to/source_solutions \
--subset-root /path/to/subset_solutions \
--tasks-json /path/to/tasks.json
# 2) (Optional) Clean and auto-fix subset solutions
python -m prepare_bench_subset.clean \
--root /path/to/subset_solutions \
--data-root tasks/data
# 3) Run solutions in batch mode
python -m prepare_bench_subset.run \
--batch \
--solutions-root /path/to/subset_solutions \
--task-file /path/to/task_name.txt \
--dockerfile prepare_bench_subset/env/Dockerfile \
--data-dir tasks/data
# 4) Grade all submissions
python -m prepare_bench_subset.grade auto-grade \
--task-list /path/to/task_name.txt \
--solutions-dir /path/to/subset_solutions \
--data-dir tasks/data \
--competitions-dir tasks/competitions
# 5) Build A/B comparison groups (n=2)
python -m prepare_bench_subset.group \
--task-file /path/to/task_name.txt \
--solutions-root /path/to/subset_solutions \
--group-size 2 \
--balancedThis repository relies on shared config files:
-
/prepare_bench_subset/config/tasks.json:- Defines semantic sampling quotas per task (e.g., "Sample 5 PyTorch solutions").
- Consumed by
extract_subset.py.
-
/prepare_bench_subset/config/task_name.txt:- Plain-text list of tasks to process.
-
/tasks/data/:- Stores prepared competition data (input for execution and grading).
- Path issues: Use absolute paths. Verify
solutions_rootandtasks/dataexist. - Task name mismatch: Ensure names in
task_name.txtmatch directory names exactly. - Docker not prepared:
run.py --batchrequires Docker. Pre-pull the image if needed. - No submission.csv: Check
exec_output.txtor userun.py --summary-buggyto debug failed runs. - Grading errors: Verify
competitions-dirpoints to the correct metadata location.
- I have a complete
solutions_root(includingannotation/andcode/). - My
tasks/data/<task>directories exist. - Task names in
task_name_subset.txtmatch the directories. - I can see
solution_*.pyunderout_root/<task>/code/. -
submission.csvappears after runningrun.py. -
eval_output.jsonappears after runninggrade.py. -
ground_truth/groups_<task>_n2.jsonappears after runninggroup.py.