Skip to content

ericzhao28/sampling_based_search

Repository files navigation

Sample, Scrutinize, and Scale: Code for Sampling-Based Search

This repository provides the source code for the paper Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification. This code provides a minimalist implementation of sampling-based search using random sampling and self-verification. It is implemented for the Gemini model family through the Vertex AI service, and involves utilities for performing many parallel batch inferences (e.g., run 5 million+ conversations with 30,000 output tokens each).

⚠️ WARNING: Running this code will incur significant API costs due to the large-scale nature of the experiments.

Requirements

  • Python 3.9+
  • Google Cloud project with Vertex AI API enabled
  • Service account with Vertex AI permissions
  • Google Cloud SDK

Installation

  1. Clone the repository:
git clone https://github.com/google-research/google-research/sampling-based-search.git
cd sampling-based-search
  1. Install dependencies:
pip install -r requirements.txt

Configuration

  1. Set up Google Cloud authentication:
# Configure gcloud with your project
gcloud config set project YOUR_PROJECT_ID

# Authenticate application default credentials
gcloud auth application-default login
  1. Environment Configuration:
# Create .env file in project root
touch .env

# Add the following required environment variables to your .env file:
OPENAI_API_KEY="your-openai-api-key"
HUGGING_FACE_TOKEN="your-huggingface-token"
GCP_PROJECT_ID="your-gcp-project-id"
GCP_PROJECT_NUM="your-gcp-project-number"
REGION="your-gcp-region"  # e.g., us-central1
BUCKET_NAME="your-gcp-bucket-name"  # e.g., "batch_resources"

Usage

The codebase provides two main scripts for running experiments, available in both Python script and Jupyter notebook formats. The notebooks contain the same functionality with configuration options available in the setup cells.

1. Generate Predictions

Run the script version:

python create_predictions_minimal.py \
  --model=publishers/google/models/gemini-1.5-pro-002 \
  --num_inf_seeds=200 \
  --version=publicv1 \
  --dataset_name=aime \
  --inf_temp=1.5 \
  --minibatch_size=4000 \
  --scoring_model=publishers/google/models/gemini-1.5-pro-002 \
  --num_scoring_seeds=5

Key flags:

  • --model: Model to use for generating predictions
  • --problem_cutoff: Maximum number of problems to evaluate
  • --num_inf_seeds: Number of inference seeds for sampling
  • --version: Version identifier for the run
  • --dataset_name: Dataset to use (aime/livebench/math)
  • --inf_temp: Temperature for inference sampling
  • --minibatch_size: Size of batches for processing
  • --scoring_model: Model to use for scoring
  • --num_scoring_seeds: Number of seeds for scoring

Advanced flags:

  • --split: Dataset split to use (default: "test")
  • --benchmark_seed: Random seed for benchmark (default: 0)
  • --run_fast: Enable fast mode with reduced computation
  • --roundrobin_enabled: Enable round-robin for API calls
  • --allow_recovery: Allow recovery calls for missing cache queries

The notebook version (create_predictions.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.

2. Verify Predictions

Run the script version:

python create_verifications_minimal.py \
  --model=publishers/google/models/gemini-1.5-flash-002 \
  --dataset_name=aime \
  --loaded_num_inf_seeds=200 \
  --num_inf_seeds=200 \
  --num_verif_seeds=50 \
  --inf_temp=1.5 \
  --verif_temp=1 \
  --use_comparisons=True \
  --comparison_targets=100 \
  --minibatch_size=1000 \
  --use_baby=True \
  --use_extended_rewrite=True \
  --scoring_model=publishers/google/models/gemini-1.5-pro-002

Key flags:

  • --model: Model to use for verification
  • --dataset_name: Dataset to verify
  • --loaded_problem_cutoff: Problem cutoff from loaded predictions
  • --loaded_num_inf_seeds: Number of inference seeds from loaded predictions
  • --num_inf_seeds: Number of inference seeds for verification
  • --num_verif_seeds: Number of verification seeds per sample
  • --inf_temp: Temperature for inference
  • --verif_temp: Temperature for verification
  • --use_comparisons: Enable comparison between answers
  • --comparison_targets: Number of comparison targets
  • --use_baby: Enable preliminary filtering
  • --use_extended_rewrite: Enable extended answer rewriting
  • --baby_threshold: Threshold for discarding model answers (default: 0.2)
  • --max_answers_per_final: Max answers to keep per final answer (default: 15)

Advanced flags:

  • --loaded_version: Version of loaded predictions (default: "publicv1")
  • --loaded_benchmark_seed: Benchmark seed of loaded predictions
  • --split: Dataset split to use (default: "test")
  • --region: GCP region (default: "us-central1")
  • --roundrobin_enabled: Enable round-robin for API calls
  • --allow_recovery: Allow recovery from failures
  • --run_fast: Enable fast mode
  • --minibatch_size: Size of processing batches

The notebook version (create_verifications.ipynb) provides the same functionality with these parameters configurable in the setup cells. The notebook also provides additional reporting on intermediate results.

Replication

Replicating the paper's results requires recomputing a costly number of queries. We have provided pre-computed artifacts for the results presented in Section 1 and Section 2. To use the pre-computed artifacts, follow these steps.

  1. Configure your GCP bucket:

    • Create a new GCS bucket in your project
    • Set the BUCKET_NAME environment variable in your .env file to point to your bucket
  2. Use the notebook versions for replication:

    • notebooks/create_predictions.ipynb: Use these parameters:

      PROBLEM_CUTOFF = 500
      NUM_INF_SEEDS = 200
      VERSION = "publicv1"
      DATASET_NAME = "livebench"
      INF_TEMP = 1.5
      MINIBATCH_SIZE = 4000
      NUM_SCORING_SEEDS = 5
      MODEL = "publishers/google/models/gemini-1.5-pro-002"
      SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"

      where you can also set DATASET_NAME to be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning).

    • notebooks/create_verifications.ipynb: Use these parameters:

      DATASET_NAME = "aime"
      LOADED_PROBLEM_CUTOFF = 500
      LOADED_NUM_INF_SEEDS = 200
      NUM_INF_SEEDS = 200
      NUM_VERIF_SEEDS = 50
      INF_TEMP = 1.5
      VERIF_TEMP = 1
      USE_COMPARISONS = True
      COMPARISON_TARGETS = 100
      MINIBATCH_SIZE = 1000
      USE_BABY = True
      USE_EXTENDED_REWRITE = True
      MODEL = "publishers/google/models/gemini-1.5-flash-002"
      SCORING_MODEL = "publishers/google/models/gemini-1.5-pro-002"

      where you can also set DATASET_NAME to be "aime" (AIME 2024), "math" (Berkeley MATH), or "livebench_spatial" (LiveBench Reasoning). If you set DATASET_NAME = "aime", you should also set USE_BABY = False as preliminary scoring is not used on AIME.

Note: The notebook versions include additional statistics and visualizations not available in the script versions. The Python scripts will not re-use the precomputed artifacts in the bucket.

⚠️ WARNING: If the notebooks cannot find the precomputed artifacts in your bucket, they will send fresh queries to the Vertex AI service. Disable inference or set a small budget to avoid unexpected costs.

Citation

If you use this code in your research, please cite:

@article{SSSGoogle2025,
  author    = {Eric Zhao and Pranjal Awasthi and Sreenivas Gollapudi},
  title     = {Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification}
  year      = {2025},
  url       = {https://github.com/google-research/google-research/sampling-based-search}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors