Sample, Scrutinize, and Scale: Code for Sampling-Based Search

This repository provides the source code for the paper Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification. This code provides a minimalist implementation of sampling-based search using random sampling and self-verification. It is implemented for the Gemini model family through the Vertex AI service, and involves utilities for performing many parallel batch inferences (e.g., run 5 million+ conversations with 30,000 output tokens each).

⚠️ WARNING: Running this code will incur significant API costs due to the large-scale nature of the experiments.

Requirements

Python 3.9+
Google Cloud project with Vertex AI API enabled
Service account with Vertex AI permissions
Google Cloud SDK

Installation

Clone the repository:

git clone https://github.com/google-research/google-research/sampling-based-search.git
cd sampling-based-search

Install dependencies:

pip install -r requirements.txt

Configuration

Set up Google Cloud authentication:

# Configure gcloud with your project
gcloud config set project YOUR_PROJECT_ID

# Authenticate application default credentials
gcloud auth application-default login

Environment Configuration:

# Create .env file in project root
touch .env

# Add the following required environment variables to your .env file:
OPENAI_API_KEY="your-openai-api-key"
HUGGING_FACE_TOKEN="your-huggingface-token"
GCP_PROJECT_ID="your-gcp-project-id"
GCP_PROJECT_NUM="your-gcp-project-number"
REGION="your-gcp-region"  # e.g., us-central1
BUCKET_NAME="your-gcp-bucket-name"  # e.g., "batch_resources"

Usage

The codebase provides two main scripts for running experiments, available in both Python script and Jupyter notebook formats. The notebooks contain the same functionality with configuration options available in the setup cells.

1. Generate Predictions

Run the script version:

python create_predictions_minimal.py \
  --model=publishers/google/models/gemini-1.5-pro-002 \
  --num_inf_seeds=200 \
  --version=publicv1 \
  --dataset_name=aime \
  --inf_temp=1.5 \
  --minibatch_size=4000 \
  --scoring_model=publishers/google/models/gemini-1.5-pro-002 \
  --num_scoring_seeds=5

Key flags:

--model: Model to use for generating predictions
--problem_cutoff: Maximum number of problems to evaluate
--num_inf_seeds: Number of inference seeds for sampling
--version: Version identifier for the run
--dataset_name: Dataset to use (aime/livebench/math)
--inf_temp: Temperature for inference sampling
--minibatch_size: Size of batches for processing
--scoring_model: Model to use for scoring
--num_scoring_seeds: Number of seeds for scoring

Advanced flags:

--split: Dataset split to use (default: "test")
--benchmark_seed: Random seed for benchmark (default: 0)
--run_fast: Enable fast mode with reduced computation
--roundrobin_enabled: Enable round-robin for API calls
--allow_recovery: Allow recovery calls for missing cache queries

The notebook version (create_predictions.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.

2. Verify Predictions

Run the script version:

python create_verifications_minimal.py \
  --model=publishers/google/models/gemini-1.5-flash-002 \
  --dataset_name=aime \
  --loaded_num_inf_seeds=200 \
  --num_inf_seeds=200 \
  --num_verif_seeds=50 \
  --inf_temp=1.5 \
  --verif_temp=1 \
  --use_comparisons=True \
  --comparison_targets=100 \
  --minibatch_size=1000 \
  --use_baby=True \
  --use_extended_rewrite=True \
  --scoring_model=publishers/google/models/gemini-1.5-pro-002

Key flags:

--model: Model to use for verification
--dataset_name: Dataset to verify
--loaded_problem_cutoff: Problem cutoff from loaded predictions
--loaded_num_inf_seeds: Number of inference seeds from loaded predictions
--num_inf_seeds: Number of inference seeds for verification
--num_verif_seeds: Number of verification seeds per sample
--inf_temp: Temperature for inference
--verif_temp: Temperature for verification
--use_comparisons: Enable comparison between answers
--comparison_targets: Number of comparison targets
--use_baby: Enable preliminary filtering
--use_extended_rewrite: Enable extended answer rewriting
--baby_threshold: Threshold for discarding model answers (default: 0.2)
--max_answers_per_final: Max answers to keep per final answer (default: 15)

Advanced flags:

--loaded_version: Version of loaded predictions (default: "publicv1")
--loaded_benchmark_seed: Benchmark seed of loaded predictions
--split: Dataset split to use (default: "test")
--region: GCP region (default: "us-central1")
--roundrobin_enabled: Enable round-robin for API calls
--allow_recovery: Allow recovery from failures
--run_fast: Enable fast mode
--minibatch_size: Size of processing batches

The notebook version (create_verifications.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.

Replication

Replicating the paper's results requires recomputing a costly number of queries. We have provided pre-computed artifacts for the results presented in Section 1 and Section 2. To use the pre-computed artifacts, follow these steps.

Configure your GCP bucket:
- Create a new GCS bucket in your project
- Set the BUCKET_NAME environment variable in your .env file to point to your bucket

Use the notebook versions for replication:

notebooks/create_predictions.ipynb: Use these parameters:

PROBLEM_CUTOFF = 500
NUM_INF_SEEDS = 200
VERSION = "publicv1"
DATASET_NAME = "livebench"
INF_TEMP = 1.5
MINIBATCH_SIZE = 4000
NUM_SCORING_SEEDS = 5
MODEL = "publishers/google/models/gemini-1.5-pro-002"
SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"

where you can also set DATASET_NAME to be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning).

notebooks/create_verifications.ipynb: Use these parameters:

DATASET_NAME = "aime"
LOADED_PROBLEM_CUTOFF = 500
LOADED_NUM_INF_SEEDS = 200
NUM_INF_SEEDS = 200
NUM_VERIF_SEEDS = 50
INF_TEMP = 1.5
VERIF_TEMP = 1
USE_COMPARISONS = True
COMPARISON_TARGETS = 100
MINIBATCH_SIZE = 1000
USE_BABY = True
USE_EXTENDED_REWRITE = True
MODEL = "publishers/google/models/gemini-1.5-flash-002"
SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"

where you can also set DATASET_NAME to be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning). If you set DATASET_NAME = "aime", you should also set USE_BABY = False as preliminary scoring is not used on AIME.

Note: The notebook versions include additional statistics and visualizations not available in the script versions. The Python scripts will not re-use the precomputed artifacts in the bucket.

⚠️ WARNING: If the notebooks cannot find the precomputed artifacts in your bucket, they will send fresh queries to the Vertex AI service. Disable inference or set a small budget to avoid unexpected costs.

Citation

If you use this code in your research, please cite:

@article{SSSGoogle2025,
  author    = {Eric Zhao and Pranjal Awasthi and Sreenivas Gollapudi},
  title     = {Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification}
  year      = {2025},
  url       = {https://github.com/google-research/google-research/sampling-based-search}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sample, Scrutinize, and Scale: Code for Sampling-Based Search

Requirements

Installation

Configuration

Usage

1. Generate Predictions

2. Verify Predictions

Replication

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
verification		verification
LICENSE		LICENSE
README.md		README.md
create_predictions.ipynb		create_predictions.ipynb
create_predictions_minimal.py		create_predictions_minimal.py
create_verifications.ipynb		create_verifications.ipynb
create_verifications_minimal.py		create_verifications_minimal.py
create_verifications_prompt_tuning.ipynb		create_verifications_prompt_tuning.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Sample, Scrutinize, and Scale: Code for Sampling-Based Search

Requirements

Installation

Configuration

Usage

1. Generate Predictions

2. Verify Predictions

Replication

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages