LLM-as-Program-Equilibrium Harness

A reproducible testbed for the partial-information program-equilibrium direction Caspar Oesterheld raised in AXRP Episode 49 (Feb 2026, 02:24:19):

"If my program is 'I prompt a particular language model' and then you know my prompt but you don't know all the weights of my language model... that is a sort of partial information program equilibrium. So I think that is another natural direction."

Each LLM agent is a program — a triple (model_id, system_prompt, temperature). Each agent receives the other's prompt and model id (but not weights), then both simulate each other up to ε-bounded depth on canonical 2-player mixed-motive games.

This is the open-source-game-playing extension that CoopEval (Tewolde, Zhang, Piedrahita, Conitzer, Jin; AAAI-26) names in §7 as a natural direction beyond its four-mechanism suite (repetition, reputation, mediation, contracts), implemented entirely on open-weight models (Llama 3.1, Llama 3.2, Phi-4, Mixtral — none of which appear in Oesterheld et al. 2026 on surrogate goals).

Provider-agnostic design

The harness talks to any OpenAI-compatible chat-completions endpoint. Pick the provider that fits — no local model download required:

Provider	Cost	Setup	Notes
NVIDIA NIM (default)	Free for developers	API key from build.nvidia.com	Llama 3.1, Llama 3.2, Phi-4, Mixtral all free; datacenter GPU latency
Cerebras	Free tier	API key from cloud.cerebras.ai	Very fast inference
Groq	Free tier	API key from console.groq.com	Very fast; rate-limited
Local Ollama	Free	`ollama serve` + `ollama pull`	Fully offline, ~25 GB disk for full panel
Custom	varies	`LLM_BASE_URL` + `LLM_API_KEY` env vars	Any other OpenAI-compatible endpoint (Together, OpenRouter, ...)

Quick start

Prerequisites

Python 3.9+
An API key for one of the providers above (or local Ollama)

Install

cd llm-program-equilibrium
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Run smoke test (2 trials, ~30s on hosted)

With NVIDIA NIM (recommended):

export NVIDIA_API_KEY="nvapi-..."
python notebooks/headline_experiment.py --provider nvidia --grid smoke --trials 1

With local Ollama:

ollama serve
ollama pull llama3.1:8b-instruct  # one model is enough for the smoke grid
python notebooks/headline_experiment.py --provider ollama --grid smoke --trials 1

With a custom OpenAI-compatible provider:

export LLM_BASE_URL="https://api.together.xyz/v1"
export LLM_API_KEY="..."
python notebooks/headline_experiment.py --provider custom --grid smoke --trials 1

Run the headline grid (~30 min on NIM, longer locally)

python notebooks/headline_experiment.py --provider nvidia --grid headline --trials 10

Run the full grid (~1-3 hours on NIM with rate limits)

python notebooks/headline_experiment.py --provider nvidia --grid full --trials 20

Results are written to results/<grid>.json after every trial (incremental, safe to interrupt). A summary table is printed at the end.

Project layout

llm-program-equilibrium/
├── README.md
├── LICENSE                            # Apache-2.0
├── requirements.txt
├── src/
│   ├── program.py                     # Program = (model_id, system_prompt, temperature)
│   ├── games.py                       # PD, Stag Hunt, Chicken, BoS
│   ├── llm_client.py                  # Provider-agnostic OpenAI-compatible client
│   ├── simulator.py                   # εGroundedπBot recursive simulation
│   ├── experiment.py                  # Condition + TrialResult + run_grid
│   └── analysis.py                    # Cooperation rate, 95% CI, refusal rate
├── notebooks/
│   └── headline_experiment.py         # Entry point: provider × grid
├── tests/
│   └── test_smoke.py                  # Stub-LLM tests (no network)
├── results/                           # JSON outputs
├── writeup/
│   └── tech_report.md                 # 4-6 page tech report
└── notes/
    └── surrogate_goals_paper_notes.md # Pre-build study notes

What the harness measures

For each (game × program-pair × ε × max_depth) condition with N trials:

Cooperation rate — fraction of trials where the joint outcome is in the game's cooperative set.
95% confidence interval — 1.96 × sample SD (Wald), matching the convention in Oesterheld et al. 2026.
LLM call count — total inference attempts per round; empirical version of the simulation cost in Oesterheld's compiler-optimization direction (AXRP 49, 01:09:25).
Refusal rate — proportion of LLM calls returning unparseable output or a network error. Watch this per (model, prompt) — the surrogate-goals paper hit ~46% on GPT-3.5 for a related task.

Run the tests

python -m pytest tests/ -v

Smoke tests use a stub LLM client and run in milliseconds; no provider credentials required.

Motivation

Program equilibrium (Tennenholtz, 2004; Oesterheld, 2019; Clift, Kovařík, Oesterheld, Conitzer, 2025) provides cooperation-supporting equilibria for agents that can read each other's source code. The natural LLM-agent specialization — where the "program" is a prompt and weights are not directly inspectable — was articulated in AXRP Episode 49 (Feb 2026) but lacked a public implementation. Independently, CoopEval (Tewolde et al., AAAI-26) §7 names "open-source game playing" as a natural extension to its four-mechanism cooperation suite. This harness fills both gaps with a reproducible empirical surface and a working definition of stochastic-program partial-information counterfactual.

Full method, definition, and discussion: writeup/tech_report.md.

License

Apache-2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-as-Program-Equilibrium Harness

Provider-agnostic design

Quick start

Prerequisites

Install

Run smoke test (2 trials, ~30s on hosted)

Run the headline grid (~30 min on NIM, longer locally)

Run the full grid (~1-3 hours on NIM with rate limits)

Project layout

What the harness measures

Run the tests

Motivation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
results		results
src		src
tests		tests
writeup		writeup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM-as-Program-Equilibrium Harness

Provider-agnostic design

Quick start

Prerequisites

Install

Run smoke test (2 trials, ~30s on hosted)

Run the headline grid (~30 min on NIM, longer locally)

Run the full grid (~1-3 hours on NIM with rate limits)

Project layout

What the harness measures

Run the tests

Motivation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages