A reproducible testbed for the partial-information program-equilibrium direction Caspar Oesterheld raised in AXRP Episode 49 (Feb 2026, 02:24:19):
"If my program is 'I prompt a particular language model' and then you know my prompt but you don't know all the weights of my language model... that is a sort of partial information program equilibrium. So I think that is another natural direction."
Each LLM agent is a program — a triple (model_id, system_prompt, temperature). Each agent receives the other's prompt and model id (but not weights), then both simulate each other up to ε-bounded depth on canonical 2-player mixed-motive games.
This is the open-source-game-playing extension that CoopEval (Tewolde, Zhang, Piedrahita, Conitzer, Jin; AAAI-26) names in §7 as a natural direction beyond its four-mechanism suite (repetition, reputation, mediation, contracts), implemented entirely on open-weight models (Llama 3.1, Llama 3.2, Phi-4, Mixtral — none of which appear in Oesterheld et al. 2026 on surrogate goals).
The harness talks to any OpenAI-compatible chat-completions endpoint. Pick the provider that fits — no local model download required:
| Provider | Cost | Setup | Notes |
|---|---|---|---|
| NVIDIA NIM (default) | Free for developers | API key from build.nvidia.com | Llama 3.1, Llama 3.2, Phi-4, Mixtral all free; datacenter GPU latency |
| Cerebras | Free tier | API key from cloud.cerebras.ai | Very fast inference |
| Groq | Free tier | API key from console.groq.com | Very fast; rate-limited |
| Local Ollama | Free | ollama serve + ollama pull |
Fully offline, ~25 GB disk for full panel |
| Custom | varies | LLM_BASE_URL + LLM_API_KEY env vars |
Any other OpenAI-compatible endpoint (Together, OpenRouter, ...) |
- Python 3.9+
- An API key for one of the providers above (or local Ollama)
cd llm-program-equilibrium
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtWith NVIDIA NIM (recommended):
export NVIDIA_API_KEY="nvapi-..."
python notebooks/headline_experiment.py --provider nvidia --grid smoke --trials 1With local Ollama:
ollama serve
ollama pull llama3.1:8b-instruct # one model is enough for the smoke grid
python notebooks/headline_experiment.py --provider ollama --grid smoke --trials 1With a custom OpenAI-compatible provider:
export LLM_BASE_URL="https://api.together.xyz/v1"
export LLM_API_KEY="..."
python notebooks/headline_experiment.py --provider custom --grid smoke --trials 1python notebooks/headline_experiment.py --provider nvidia --grid headline --trials 10python notebooks/headline_experiment.py --provider nvidia --grid full --trials 20Results are written to results/<grid>.json after every trial (incremental, safe to interrupt). A summary table is printed at the end.
llm-program-equilibrium/
├── README.md
├── LICENSE # Apache-2.0
├── requirements.txt
├── src/
│ ├── program.py # Program = (model_id, system_prompt, temperature)
│ ├── games.py # PD, Stag Hunt, Chicken, BoS
│ ├── llm_client.py # Provider-agnostic OpenAI-compatible client
│ ├── simulator.py # εGroundedπBot recursive simulation
│ ├── experiment.py # Condition + TrialResult + run_grid
│ └── analysis.py # Cooperation rate, 95% CI, refusal rate
├── notebooks/
│ └── headline_experiment.py # Entry point: provider × grid
├── tests/
│ └── test_smoke.py # Stub-LLM tests (no network)
├── results/ # JSON outputs
├── writeup/
│ └── tech_report.md # 4-6 page tech report
└── notes/
└── surrogate_goals_paper_notes.md # Pre-build study notes
For each (game × program-pair × ε × max_depth) condition with N trials:
- Cooperation rate — fraction of trials where the joint outcome is in the game's cooperative set.
- 95% confidence interval —
1.96 × sample SD(Wald), matching the convention in Oesterheld et al. 2026. - LLM call count — total inference attempts per round; empirical version of the simulation cost in Oesterheld's compiler-optimization direction (AXRP 49, 01:09:25).
- Refusal rate — proportion of LLM calls returning unparseable output or a network error. Watch this per
(model, prompt)— the surrogate-goals paper hit ~46% on GPT-3.5 for a related task.
python -m pytest tests/ -vSmoke tests use a stub LLM client and run in milliseconds; no provider credentials required.
Program equilibrium (Tennenholtz, 2004; Oesterheld, 2019; Clift, Kovařík, Oesterheld, Conitzer, 2025) provides cooperation-supporting equilibria for agents that can read each other's source code. The natural LLM-agent specialization — where the "program" is a prompt and weights are not directly inspectable — was articulated in AXRP Episode 49 (Feb 2026) but lacked a public implementation. Independently, CoopEval (Tewolde et al., AAAI-26) §7 names "open-source game playing" as a natural extension to its four-mechanism cooperation suite. This harness fills both gaps with a reproducible empirical surface and a working definition of stochastic-program partial-information counterfactual.
Full method, definition, and discussion: writeup/tech_report.md.
Apache-2.0. See LICENSE.