This project implements a complete OpenEnv-compatible environment server for developer workflow automation. It covers three realistic task families that show up in support and engineering operations: spotting bad data in CSV files, triaging customer email, and reviewing Python pull requests for security, logic, and style defects. The goal is to give an autonomous agent a small but realistic benchmark where actions can be evaluated deterministically through reset(), step(), and state() endpoints.
nulls: list[list[int]]identifies empty cells as[row_idx, col_idx].duplicates: list[int]identifies row indices that duplicate an earlier row.
email_id: strselects the email currently under review.category: strmust be one ofbilling,technical,general, orspam.priority: strmust be one ofurgent,normal, orlow.note: str | Noneis optional except during urgent routing work, where a useful routing note is expected.
FindingActionphase: "identify"issue_type: "security" | "logic" | "style"line: intdescription: str
FixActionphase: "fix"line: intoriginal: strfixed: strrationale: str
SummaryActionphase: "summarize"approved: boolrisk_level: "high" | "medium" | "low"summary: str
csv_content: strraw CSV text containing 20 data rows.column_names: list[str]step: int
emails: list[dict]containing the current email payload asid,subject, andbody.current_step_type: "classify" | "prioritize" | "route"step: int
diff: strunified diff of the Python file under review.filename: strcurrent_phase: "identify" | "fix" | "summarize"issues_found: list[dict]containing accumulated findings and whether they matched a true issue.step: int
data-triage-easy: single-turn CSV inspection with deterministic null and duplicate labels.email-triage-medium: multi-step support operations workflow over seeded email scenarios.code-review-hard: iterative PR review over a seeded Python diff with seven planted issues.
+0.17per correct null cell+0.25per correct duplicate row-0.10per false positive- Final grader score: normalized null accuracy, duplicate accuracy, and false-positive penalty clamped into the open interval
(0, 1)for validator compatibility
+0.075the first time an email receives the correct(category, priority)pair+0.025for an urgent-email routing note that includes a correct issue keyword-0.05whenever spam is marked as urgent- Final grader score: weighted category accuracy, priority accuracy, and urgent-note keyword overlap clamped into
(0, 1)
+0.057for each newly identified true issue+0.050for a correct fix submitted after the issue has been identified-0.040per false positive finding-0.100for approving the diff while any security issue remains undetected- Final grader score: weighted detection, fix quality, and summary correctness clamped into
(0, 1)
Install dependencies:
pip install -r requirements.txtRun the FastAPI environment server locally on the Hugging Face Spaces default port:
uvicorn server:app --host 0.0.0.0 --port 7860Run inference against the local server:
set API_KEY=your_proxy_key_here
set API_BASE_URL=your_proxy_base_url_here
python inference.py --server-url http://127.0.0.1:7860For local Hugging Face Router testing, HF_TOKEN is also supported as a fallback when API_KEY is not set.
By default, inference.py evaluates all three tasks in sequence. You can still target a single task with --task data-triage-easy, --task email-triage-medium, or --task code-review-hard.
Build the container image:
docker build .data-triage-easy: 0.999email-triage-medium: 0.808code-review-hard: 0.933
https://huggingface.co/spaces/itzrick/developer-workflow-env