Code for running Reflective Prompted Policy Optimization (R2PO), baseline policy-search experiments, SB3 baselines, and analysis scripts.
Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias.
- A two-stage LLM policy optimizer that separates global parameter search from local rollout-grounded revision.
- A trajectory-evidence design that summarizes policy behavior with aggregate rollout statistics, a median trajectory, and a conservative revision rule.
- An analysis of salience bias, a failure mode where trajectory-grounded revision overreacts to vivid but unrepresentative failures.
- Matched-budget ablations that isolate the roles of trajectory evidence, two-stage reasoning, and keep-best selection.
- Using
gpt-oss:20b-cloud, R2PO achieves the highest mean reward on 9 of 10 environments; on MountainCarContinuous it is statistically tied with SAC. - R2PO matches or exceeds all baselines on mean best reward across all 10 environments, reaching maximum return on CartPole, InvertedPendulum, Pong, and Nim.
- Compared with ProPS+, R2PO performs better on all 10 environments and reaches strong policies earlier, including near-optimal CartPole performance within about 500 episodes.
- Ablations show that the gains do not come from extra LLM calls or richer prompts alone; they require trajectory-grounded revision, two-stage design, and keep-best selection.
- The salience-bias analysis explains 76.6% of ThreeTraj regressions on CartPole; R2PO mitigates this with aggregate statistics, median-trajectory selection, and the revision rule.
We recommend Python 3.12 with uv. Python 3.10+ should also work, but the released experiments were checked with Python 3.12.
uv venv --python 3.12
uv pip install -r requirements.txt
uv pip install -e envs/gym-maze-masterIf you prefer Conda for Python management, create the environment with Conda and still use uv for package installation:
conda create -n r2po python=3.12
conda activate r2po
uv pip install -r requirements.txt
uv pip install -e envs/gym-maze-masterCopy the example environment file and fill in the provider you will use:
cp .env.example .envThe configs use Ollama-format model names, for example:
llm_model_name: ollama/gpt-oss:20b-cloudFor local Ollama, run the Ollama server locally and set:
OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_API_KEY is not needed for local OllamaFor Ollama cloud / gpt-oss:20b-cloud, set:
OLLAMA_BASE_URL=https://ollama.com
OLLAMA_API_KEY=your_ollama_api_key_hereThe code sends Ollama requests to $OLLAMA_BASE_URL/api/chat and uses the model name after the ollama/ prefix. For example, ollama/gpt-oss:20b-cloud becomes gpt-oss:20b-cloud in the request payload.
Optional provider keys are also supported for configs that use those model names:
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GEMINI_API_KEY=your_gemini_key_hereRun a single R2PO experiment on CartPole:
uv run python main.py --config configs/cartpole/cartpole_reflective_prompted_policy_optimization.yamlRun a single baseline or ablation by changing the config path, for example:
uv run python main.py --config configs/cartpole/cartpole_props.yaml
uv run python main.py --config configs/cartpole/cartpole_propsp.yaml
uv run python main.py --config configs/cartpole/cartpole_reptraj.yaml
uv run python main.py --config configs/cartpole/cartpole_three_traj.yamlEach run writes to the logdir specified in the config and creates an auto-incremented run_N/ directory.
Evaluate an existing experiment log without rerunning training:
uv run python eval.py \
--logdir logs/cartpole_reflective_prompted_policy_optimization \
--labels R2POThis repo does not vendor rl-baselines3-zoo. If you want to run Zoo-based baselines, install it beside the repo after setting up the main environment:
git clone https://github.com/DLR-RM/rl-baselines3-zoo.git
cd rl-baselines3-zoo
uv pip install -e .
cd ..SB3 baselines use separate configs:
uv run python sb3_baselines.py --config configs/sb3_baselines/cartpole_sb3.yamlThe SB3 runner writes summaries under logs/<env>_sb3/<algorithm>/summary.json.
Training logs are written under logs/. To regenerate evaluation summaries from existing logs, run:
uv run python eval_all_results.py --max-runs 10Paper tables and figures can be regenerated from existing logs with:
uv run python paper_results_tables.py --preset core --formats md csv png --output-dir paper_results
uv run python paper_reward_curves.py --preset core --output paper_results/core_reward_curves_all.png
uv run python paper_significance_tests.py --comparison-mode all --metric best_rewardExperiment configs are grouped by environment under configs/. Common suffixes:
*_props.yaml: ProPS*_propsp.yaml: ProPS+*_reptraj.yaml: RepTraj*_pure_search.yaml: PureSearch*_actor_second_pass.yaml: ActorSecondPass*_critic_only.yaml: CriticOnly*_always_critic.yaml: AlwaysCritic*_three_traj.yaml: ThreeTraj*_reflective_prompted_policy_optimization.yaml: R2PO
@misc{hara2026reflectivepromptedpolicyoptimization,
title={Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias},
author={Rahaf Abu Hara and Vaibbhav Murarri and Claudio Zito},
year={2026},
eprint={2605.08315},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.08315},
}