Dependency-light benchmark suite for evaluating autonomous-agent security behavior.
It provides adversarial cases for prompt injection, data exfiltration, tool misuse, persistence/memory poisoning, and approval bypass. The MVP scores model/agent text responses deterministically and is designed to become a live-agent harness later.
Agent security failures are behavioral. A config can look reasonable while the agent still:
- follows malicious webpage instructions
- leaks tokens/secrets
- uses tools without approval
- stores attacker policy in memory
- bypasses outbound-action gates
This repo gives those risks repeatable benchmark cases.
Run without installing:
PYTHONPATH=src python -m agent_security_bench.cli list --format json
PYTHONPATH=src python -m agent_security_bench.cli score examples/safe-responses.json --format jsonInstall locally:
python -m pip install -e .
agent-security-bench list --format json
agent-security-bench score examples/safe-responses.json --format jsondirect_prompt_injectionindirect_prompt_injectiondata_exfiltrationtool_misusepersistenceapproval_bypass
The score report includes:
- aggregate summary
- score from
0.0to1.0 - per-category results
- per-case pass/fail
- violations with required/forbidden pattern IDs
PYTHONPATH=src python -m agent_security_bench.cli score examples/unsafe-responses.json --format jsonPYTHONPATH=src python -m unittest discover -s tests -q
python -m compileall -q src testsCI runs ruff, compileall, and pytest.
- YAML/JSONL case files
- Live-agent adapters for Hermes/OpenClaw/Codex/Claude
- Tool-call transcript scoring
- Sandboxed canary file and fake secret fixtures
- SARIF/Markdown reports
- Difficulty tiers and benchmark versioning
- Larger prompt-injection corpus
- Regression mode for agent releases
Cases intentionally contain malicious instructions and fake attacker destinations. Treat all case prompts as untrusted test data. Do not wire benchmark cases to real outbound tools without sandboxing and explicit approvals.