This repository provides the source code for the paper Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification. This code provides a minimalist implementation of sampling-based search using random sampling and self-verification. It is implemented for the Gemini model family through the Vertex AI service, and involves utilities for performing many parallel batch inferences (e.g., run 5 million+ conversations with 30,000 output tokens each).
⚠️ WARNING: Running this code will incur significant API costs due to the large-scale nature of the experiments.
- Python 3.9+
- Google Cloud project with Vertex AI API enabled
- Service account with Vertex AI permissions
- Google Cloud SDK
- Clone the repository:
git clone https://github.com/google-research/google-research/sampling-based-search.git
cd sampling-based-search- Install dependencies:
pip install -r requirements.txt- Set up Google Cloud authentication:
# Configure gcloud with your project
gcloud config set project YOUR_PROJECT_ID
# Authenticate application default credentials
gcloud auth application-default login- Environment Configuration:
# Create .env file in project root
touch .env
# Add the following required environment variables to your .env file:
OPENAI_API_KEY="your-openai-api-key"
HUGGING_FACE_TOKEN="your-huggingface-token"
GCP_PROJECT_ID="your-gcp-project-id"
GCP_PROJECT_NUM="your-gcp-project-number"
REGION="your-gcp-region" # e.g., us-central1
BUCKET_NAME="your-gcp-bucket-name" # e.g., "batch_resources"The codebase provides two main scripts for running experiments, available in both Python script and Jupyter notebook formats. The notebooks contain the same functionality with configuration options available in the setup cells.
Run the script version:
python create_predictions_minimal.py \
--model=publishers/google/models/gemini-1.5-pro-002 \
--num_inf_seeds=200 \
--version=publicv1 \
--dataset_name=aime \
--inf_temp=1.5 \
--minibatch_size=4000 \
--scoring_model=publishers/google/models/gemini-1.5-pro-002 \
--num_scoring_seeds=5Key flags:
--model: Model to use for generating predictions--problem_cutoff: Maximum number of problems to evaluate--num_inf_seeds: Number of inference seeds for sampling--version: Version identifier for the run--dataset_name: Dataset to use (aime/livebench/math)--inf_temp: Temperature for inference sampling--minibatch_size: Size of batches for processing--scoring_model: Model to use for scoring--num_scoring_seeds: Number of seeds for scoring
Advanced flags:
--split: Dataset split to use (default: "test")--benchmark_seed: Random seed for benchmark (default: 0)--run_fast: Enable fast mode with reduced computation--roundrobin_enabled: Enable round-robin for API calls--allow_recovery: Allow recovery calls for missing cache queries
The notebook version (create_predictions.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.
Run the script version:
python create_verifications_minimal.py \
--model=publishers/google/models/gemini-1.5-flash-002 \
--dataset_name=aime \
--loaded_num_inf_seeds=200 \
--num_inf_seeds=200 \
--num_verif_seeds=50 \
--inf_temp=1.5 \
--verif_temp=1 \
--use_comparisons=True \
--comparison_targets=100 \
--minibatch_size=1000 \
--use_baby=True \
--use_extended_rewrite=True \
--scoring_model=publishers/google/models/gemini-1.5-pro-002Key flags:
--model: Model to use for verification--dataset_name: Dataset to verify--loaded_problem_cutoff: Problem cutoff from loaded predictions--loaded_num_inf_seeds: Number of inference seeds from loaded predictions--num_inf_seeds: Number of inference seeds for verification--num_verif_seeds: Number of verification seeds per sample--inf_temp: Temperature for inference--verif_temp: Temperature for verification--use_comparisons: Enable comparison between answers--comparison_targets: Number of comparison targets--use_baby: Enable preliminary filtering--use_extended_rewrite: Enable extended answer rewriting--baby_threshold: Threshold for discarding model answers (default: 0.2)--max_answers_per_final: Max answers to keep per final answer (default: 15)
Advanced flags:
--loaded_version: Version of loaded predictions (default: "publicv1")--loaded_benchmark_seed: Benchmark seed of loaded predictions--split: Dataset split to use (default: "test")--region: GCP region (default: "us-central1")--roundrobin_enabled: Enable round-robin for API calls--allow_recovery: Allow recovery from failures--run_fast: Enable fast mode--minibatch_size: Size of processing batches
The notebook version (create_verifications.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.
Replicating the paper's results requires recomputing a costly number of queries. We have provided pre-computed artifacts for the results presented in Section 1 and Section 2. To use the pre-computed artifacts, follow these steps.
-
Configure your GCP bucket:
- Create a new GCS bucket in your project
- Set the
BUCKET_NAMEenvironment variable in your.envfile to point to your bucket
-
Use the notebook versions for replication:
-
notebooks/create_predictions.ipynb: Use these parameters:PROBLEM_CUTOFF = 500 NUM_INF_SEEDS = 200 VERSION = "publicv1" DATASET_NAME = "livebench" INF_TEMP = 1.5 MINIBATCH_SIZE = 4000 NUM_SCORING_SEEDS = 5 MODEL = "publishers/google/models/gemini-1.5-pro-002" SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"
where you can also set
DATASET_NAMEto be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning). -
notebooks/create_verifications.ipynb: Use these parameters:DATASET_NAME = "aime" LOADED_PROBLEM_CUTOFF = 500 LOADED_NUM_INF_SEEDS = 200 NUM_INF_SEEDS = 200 NUM_VERIF_SEEDS = 50 INF_TEMP = 1.5 VERIF_TEMP = 1 USE_COMPARISONS = True COMPARISON_TARGETS = 100 MINIBATCH_SIZE = 1000 USE_BABY = True USE_EXTENDED_REWRITE = True MODEL = "publishers/google/models/gemini-1.5-flash-002" SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"
where you can also set
DATASET_NAMEto be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning). If you setDATASET_NAME = "aime", you should also setUSE_BABY = Falseas preliminary scoring is not used on AIME.
-
Note: The notebook versions include additional statistics and visualizations not available in the script versions. The Python scripts will not re-use the precomputed artifacts in the bucket.
⚠️ WARNING: If the notebooks cannot find the precomputed artifacts in your bucket, they will send fresh queries to the Vertex AI service. Disable inference or set a small budget to avoid unexpected costs.
If you use this code in your research, please cite:
@article{SSSGoogle2025,
author = {Eric Zhao and Pranjal Awasthi and Sreenivas Gollapudi},
title = {Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification}
year = {2025},
url = {https://github.com/google-research/google-research/sampling-based-search}
}This project is licensed under the MIT License - see the LICENSE file for details.