| title | Scheme Enrollment Env | ||||
|---|---|---|---|---|---|
| emoji | 🏢 | ||||
| colorFrom | blue | ||||
| colorTo | green | ||||
| sdk | docker | ||||
| pinned | false | ||||
| app_port | 7860 | ||||
| tags |
|
A reinforcement learning benchmark for bureaucratic reasoning: interviewing applicants, verifying documents, applying strict scheme rules, detecting fraud, and knowing when to escalate rather than decide.
Priya is a CSC operator in Barmer, Rajasthan. She interviews dozens of applicants every day across a wooden desk, a government-issue computer, and a slow internet connection. One afternoon, a young man walks in claiming to be a student. He wants to enroll in PMKVY, a skill-training scheme. On the surface, his profile looks plausible.
But something feels wrong. His income is unusually high for a student. Priya asks for his PAN card. It reveals six years of active pension-linked employment from a public sector company. He is not a student. He is attempting to claim a benefit under false pretenses.
Priya does not guess. She does not overreach. She escalates the case.
This environment trains AI agents to behave like Priya.
Not just to read a table of rules, but to:
- gather missing information before acting
- verify the right document at the right time
- apply exact arithmetic boundaries
- ignore irrelevant context
- distinguish ineligibility from contradiction
- escalate only when escalation is genuinely required
Most RL and agent benchmarks focus on coding, games, search, or generic dialogue. Very few test policy compliance under partial observability, exact thresholds, and procedural safety.
This environment exists to measure a harder and more realistic capability cluster:
- Policy compliance under uncertainty: the agent must collect evidence before deciding
- Fraud detection through document verification: contradictions emerge only after the correct document is requested
- Boundary arithmetic:
9999qualifies,10000does not - Escalation protocol: the agent must know when not to decide
- Noise filtering: irrelevant profile fields appear alongside real signal
The benchmark is grounded in a workflow that affects welfare access, fraud prevention, and administrative fairness across rural India.
| Requirement | Status |
|---|---|
| Real-world task simulation | Yes — Indian CSC welfare officer workflow |
| Full OpenEnv spec: typed models, step/reset/state, openenv.yaml | Yes |
| 5 graded tasks with deterministic scoring in (0, 1) | Yes |
| Meaningful reward shaping over the full trajectory | Yes |
| Root-level inference.py using OpenAI client | Yes |
| Dockerfile + HuggingFace Space deployment | Yes |
| README with environment description, action/observation spaces, setup, baseline | Yes |
- Environment at a Glance
- Repository Structure
- System Architecture
- Environment Contract
- Action Space
- Observation Space
- Scheme Eligibility Rules
- The 5 Tasks
- The Distraction Trap
- Reward Architecture
- RL Training
- Baseline Results
- Setup and Running
- Multi-Provider Support
- Environment Variables
- Testing
- Pre-Submission Validation
- OpenEnv Compliance
| Component | Definition |
|---|---|
| State (S) | Applicant profile, partial observation state, hidden persona fields, step count |
| Action (A) | ask_question, request_document, approve_scheme, reject_applicant, escalate |
| Transition (T) | Deterministic given persona and task template |
| Reward (R) | Intermediate shaping plus terminal outcome rewards |
| Horizon | 20 steps per episode |
| Grader | Terminal normalized score strictly in open interval (0, 1) |
| Server | FastAPI via OpenEnv create_app |
| Inference | OpenAI-compatible client, provider-agnostic |
| RL Training | Gymnasium wrapper + JSONL replay buffer included |
.
├── README.md
├── pyproject.toml
├── requirements.txt
├── Dockerfile
├── openenv.yaml
├── .env.example
├── models.py
├── inference.py
├── gym_wrapper.py
├── benchmark_runner.py
├── benchmark_report.py
├── server/
│ ├── __init__.py
│ ├── app.py
│ ├── models.py
│ ├── scheme_env_environment.py
│ └── schemes.py
├── tests/
│ ├── conftest.py
│ └── test_scheme_eligibility.py
└── reports/
├── average_scores.png
├── task_heatmap.png
├── difficulty_profile.png
├── efficiency_scatter.png
├── leaderboard.csv
├── results.json
└── summary.txt
server/scheme_env_environment.py— environment lifecycle, task logic, reward shaping, step transitions, shared state, metadata sanitizationserver/schemes.py— scheme metadata, eligibility logic, optimal scheme selectionmodels.py— rootActionandObservationschemas used by inference and server logicinference.py— single-model evaluation loop, structured logging, replay buffer exportgym_wrapper.py— gymnasium-compatible wrapper for RL training librariesbenchmark_runner.py— optional multi-model orchestration layerbenchmark_report.py— report and chart generation from benchmark artifactstests/test_scheme_eligibility.py— 20 boundary-condition and grading tests
flowchart LR
A["LLM / External Policy"] --> B["inference.py\nPrompting + JSON extraction"]
B --> C["OpenEnv HTTP API\n/reset /step"]
C --> D["server/app.py\nFastAPI + create_app"]
D --> E["SchemeEnvEnvironment\nserver/scheme_env_environment.py"]
E --> F["Persona Generation"]
E --> G["Observation Builder"]
E --> H["Reward + Grader Logic"]
E --> I["Scheme Rules\nserver/schemes.py"]
B --> J["reports/replay_buffer.jsonl\nRL training data"]
B --> K["reports/*.png + *.csv + *.json\nBenchmark artifacts"]
L["gym_wrapper.py\nGymnasium interface"] --> C
M["benchmark_runner.py\nOptional orchestration"] --> B
sequenceDiagram
participant Agent as LLM Agent
participant Runner as inference.py
participant API as FastAPI/OpenEnv
participant Env as SchemeEnvEnvironment
Agent->>Runner: JSON action
Runner->>API: POST /step
API->>Env: step(action)
Env->>Env: Validate action
Env->>Env: Update hidden state
Env->>Env: Compute reward and terminal result
Env->>Env: Strip hidden metadata
Env-->>API: Observation
API-->>Runner: JSON response
Runner-->>Agent: Observation + reward
The agent sends exactly one JSON action per step:
| Action | Value | Description |
|---|---|---|
ask_question |
age, income, occupation, has_aadhaar |
Request a profile field from the applicant |
request_document |
aadhaar_card, pan_card |
Request an official document for verification |
approve_scheme |
PMKVY, MGNREGS, PMAY |
Enroll the applicant in a welfare scheme |
reject_applicant |
AGE_EXCEEDED, INCOME_TOO_HIGH, NO_ELIGIBLE_SCHEME, MISSING_REQUIRED_DATA, DATA_MISMATCH, DOCUMENT_CONFLICT |
Reject with a reason category |
escalate |
MANUAL_REVIEW_REQUIRED, DATA_MISMATCH |
Hand off to a senior officer |
Example action:
{"action_type": "ask_question", "value": "occupation"}Each step returns a JSON observation with these fields:
| Field | Type | Description |
|---|---|---|
known_profile |
dict | Profile fields collected so far |
missing_data |
list | Fields still required before a terminal decision |
notification |
string | Environment feedback on the last action |
is_terminated |
bool | Whether the episode has ended |
grader_score |
float or null | Score in open interval (0, 1) — only set on termination |
metadata |
dict | Query counts: noise_queries, redundant_queries, relevant_queries |
Example observation:
{
"known_profile": {"age": "28", "income": "4500", "occupation": "mason"},
"missing_data": ["has_aadhaar"],
"notification": "Applicant confirmed: occupation = mason.",
"is_terminated": false,
"grader_score": null,
"metadata": {"noise_queries": 0, "redundant_queries": 0, "relevant_queries": 1}
}All conditions must be simultaneously true. Strict integer arithmetic — no floating-point comparisons.
| Scheme | Age | Occupation | Income | Aadhaar | Benefit |
|---|---|---|---|---|---|
| PMKVY | 18–35 | mason or carpenter | ≤ 9,999 | not required | Rs 8,000 stipend |
| MGNREGS | 18–60 | farm_labourer only | no ceiling | required | 100 days employment |
| PMAY | 21–55 | any | ≤ 5,999 | required | Rs 1.2 lakh grant |
When multiple schemes apply, choose the one with the highest benefit: PMAY > MGNREGS > PMKVY.
Critical boundaries:
income=9999qualifies for PMKVY.income=10000does not.income=5999qualifies for PMAY.income=6000does not.
Tasks increase in difficulty from Easy to Expert+. Each resets with a freshly randomized applicant persona so agents cannot memorize fixed trajectories.
Profile is complete but occupation and Aadhaar status are hidden. Agent must collect both fields then apply the benefit-priority hierarchy to choose the optimal scheme. Tests whether the agent prefers PMAY over PMKVY when both are eligible.
Two eligibility-critical fields are withheld in randomized order. Agent must request both before making any terminal decision. Tests sequential information gathering under incomplete state.
Income is hidden at episode start. When collected, it always exceeds the PMKVY ceiling by 1–2000 Rs. Agent must apply exact integer arithmetic and reject, not approve. Tests numeric boundary reasoning without hints.
Applicant claims to be a student but has suspiciously high income. PAN card reveals six years of active public sector employment — a direct contradiction. Correct resolution: request PAN card, then escalate. Approval or rejection after seeing the contradiction is a protocol violation.
Applicant self-reports an age at or near the PMKVY boundary (33–35). Aadhaar always reveals a true age above 35. Agent must request Aadhaar verification before approving or rejecting, then use the verified age as authoritative. Tests document-first reasoning under boundary pressure.
Every profile contains 1–3 irrelevant noise fields injected at episode start:
marital_status, state_of_residence, number_of_children, bank_name
Agents that query these fields receive a penalty and a notification that the field is irrelevant. This tests contextual filtering — a real CSC operator skill. The grader penalizes noise queries by −0.08 each, so a sloppy agent that asks about marital_status before approving will score lower than one that focuses only on eligibility-relevant fields.
Rewards are shaped across the full trajectory, not just at termination:
| Event | Reward |
|---|---|
| Correct terminal action | +5.0 to +10.0 |
| Soft-block penalty (recoverable) | −1.0 to −1.5 |
| Wrong terminal action | −2.0 to −5.0 |
| Timeout (20 steps) | −2.0 |
| Noise query | −0.10 |
| Redundant query | −0.10 |
| Valid information gathering | 0.0 |
The grader converts a terminal outcome into a continuous score strictly in the open interval (0, 1):
score = clamp(base_score − noise_penalty − redundant_penalty − step_waste + doc_bonus, 0.301, 0.989)
noise_queries→ −0.08 eachredundant_queries→ −0.05 eachwasted_steps→ −0.04 each (Task 2 only)document_verified→ +0.05 bonus (Tasks 4 and 5)- Floor: 0.301 — correct but sloppy agents always outscore wrong ones
- Ceiling: 0.989 — platform requires scores strictly below 1.0
This environment is a complete RL training setup, not just an evaluation benchmark.
Gymnasium wrapper (gym_wrapper.py) wraps the HTTP server as a standard gymnasium.Env so PPO, DQN, or GRPO training loops can plug in directly without any server changes:
from gym_wrapper import SchemeEnvGym
env = SchemeEnvGym(task=1)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step_with_action("ask_question", "occupation")Replay buffer — every inference run saves all episode transitions to reports/replay_buffer.jsonl in standard (state, action, reward, next_state, done) format:
{"state": {...}, "action": {"action_type": "ask_question", "value": "occupation"}, "reward": 0.0, "next_state": {...}, "done": false, "task": 1, "model": "Qwen/Qwen2.5-7B-Instruct"}This file is directly compatible with GRPO fine-tuning and DPO preference pairs. The grader score serves as the reward signal. Correct episode trajectories can be paired with incorrect ones to create preference datasets.
Tested across 8 models on NVIDIA NIM (N=3 repeats per task):
| Model | T1 | T2 | T3 | T4 | T5 | Avg |
|---|---|---|---|---|---|---|
| mistralai/mistral-nemotron (~56B) | 0.833 | 1.000 | 1.000 | 1.000 | 1.000 | 0.967 |
| nvidia/llama-3.3-nemotron-super-49b-v1 | 0.800 | 0.973 | 1.000 | 1.000 | 1.000 | 0.955 |
| nvidia/llama-3.1-nemotron-51b-instruct | 0.800 | 0.957 | 1.000 | 1.000 | 1.000 | 0.951 |
| nvidia/nemotron-3-nano-30b-a3b | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.800 |
| nvidia/nemotron-3-super-120b-a12b | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.800 |
| nvidia/nemotron-mini-4b-instruct | 0.483 | 0.667 | 0.667 | 0.967 | 0.000 | 0.557 |
| meta/llama-3.1-8b-instruct | 0.400 | 0.000 | 0.317 | 0.867 | 1.000 | 0.517 |
| nvidia/llama-3.1-nemotron-nano-8b-v1 | 0.283 | 0.303 | 0.000 | 0.333 | 0.000 | 0.184 |
Full charts in reports/:
| Chart | Description |
|---|---|
average_scores.png |
All 8 models ranked by average score |
task_heatmap.png |
Per-task score grid — red = 0.0, green = 1.0 |
difficulty_profile.png |
Mean score per task across all models |
efficiency_scatter.png |
Average score vs Task 4 escalation protocol score |
Task 1 consistently separates models — choosing PMAY over PMKVY when both are eligible requires understanding benefit priority, not just eligibility rules. Task 2 is the sharpest discriminator: several large models score 0.000 because they approve without waiting for all missing fields to be collected. Task 4 is protocol-heavy: once the PAN card contradiction is document-backed, most models above 30B resolve it correctly. Task 5 separates small models sharply — understanding the age conflict and translating it into the correct sequence (request Aadhaar → reject) requires multi-step conditional reasoning.
docker build -t scheme-enrollment-env .
docker run -p 7860:7860 scheme-enrollment-env
curl http://localhost:7860/healthgit clone https://github.com/advikdivekar/rl-agent.git
cd rl-agent
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH=.
uvicorn server.app:app --host 0.0.0.0 --port 7860curl http://localhost:7860/health
# {"status": "ok"}
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"seed": 1}'
# Returns observation with known_profile, missing_data, notificationexport HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
export ENV_URL=http://localhost:7860
export N_REPEATS=3
python inference.pyThe inference script works with any provider that supports the OpenAI /v1/chat/completions format. Only two variables change per provider:
| Provider | API_BASE_URL | Get key at |
|---|---|---|
| HuggingFace | https://router.huggingface.co/v1 |
huggingface.co/settings/tokens |
| OpenRouter | https://openrouter.ai/api/v1 |
openrouter.ai/settings/keys |
| NVIDIA NIM | https://integrate.api.nvidia.com/v1 |
build.nvidia.com |
| OpenAI | https://api.openai.com/v1 |
platform.openai.com/api-keys |
| Groq | https://api.groq.com/openai/v1 |
console.groq.com/keys |
| Together AI | https://api.together.xyz/v1 |
api.together.xyz/settings/api-keys |
Copy .env.example to .env and fill in your key:
cp .env.example .env
# Edit .env with your provider and key
python inference.py| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
unset | API key — works for all providers |
API_BASE_URL |
https://router.huggingface.co/v1 |
Provider endpoint |
MODEL_NAME |
Qwen/Qwen2.5-7B-Instruct |
Model identifier |
LOCAL_IMAGE_NAME |
unset | Optional — for from_docker_image() workflows |
ENV_URL |
http://localhost:7860 |
Environment server URL |
N_REPEATS |
3 |
Episodes per task for score averaging |
INFERENCE_TEMPERATURE |
0.0 |
Sampling temperature — 0.0 for determinism |
MAX_TOKENS |
1500 |
Max tokens per model call |
REPLAY_BUFFER_PATH |
reports/replay_buffer.jsonl |
Where episode transitions are saved |
export PYTHONPATH=.
pytest tests/ -v20 tests covering:
- PMKVY age and income boundary conditions (age=35 qualifies, age=36 does not)
- PMAY strict income ceiling (income=5999 qualifies, income=6000 does not)
- MGNREGS Aadhaar requirement
- Optimal scheme priority ordering (PMAY > MGNREGS > PMKVY)
- Grader score floor (0.301), ceiling (0.989), and penalty arithmetic
./validate-submission.sh https://advikdivekar-scheme-enrollment-env.hf.space .Checks: repo structure, inference contract, OpenEnv spec, README content, HF Space liveness, Docker build, openenv validate, Python compile, pytest.
Expected output:
========================================
Validation checks passed: 35
Submission looks ready for hackathon review.
========================================
| Requirement | Status |
|---|---|
step() / reset() / state property |
Yes |
Typed Action model with strict validation |
Yes |
Typed Observation model |
Yes |
openenv.yaml with spec_version, runtime, app, port, resources |
Yes |
/health endpoint |
Yes |
| OpenAI-compatible inference client | Yes |
Root inference.py |
Yes |
API_BASE_URL, MODEL_NAME, HF_TOKEN, LOCAL_IMAGE_NAME in inference.py |
Yes |
Structured [START] / [STEP] / [END] stdout logs |
Yes |
SCORE_JSON and STD_JSON output lines |
Yes |
| 5 graded tasks with deterministic scoring | Yes |
| Grader scores strictly in open interval (0, 1) | Yes |
| Gymnasium wrapper for RL training | Yes |
| Replay buffer export in JSONL format | Yes |
| FastAPI runtime on port 7860 | Yes |
| Resource declaration: 2 vCPU, 8GB RAM | Yes |
This benchmark is strongest when understood as a test of operational judgment, not just reasoning accuracy. The agent must be precise, skeptical, protocol-aware, and restrained. That combination is rare in benchmarks and crucial in real administration systems.
If an AI system can perform well here, it is not merely answering questions. It is behaving like a careful officer.