The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.
benchmark evaluation gemini openai procurement agents grok perplexity llm anthropic perplexity-ai grok-ai tcot
-
Updated
May 16, 2026 - Python