APIEval-20
APIEval-20 is the first benchmark designed specifically to evaluate how well AI agents can generate API test suites that actually find bugs—using only a schema and sample payload, with no access to source code or documentation. It measures real-world black-box testing capability across 20 diverse API scenarios spanning e-commerce, payments, authentication, and more.
Product Highlights
- Black-Box Evaluation: Tests AI agents with only JSON schema and sample payload—mirroring how developers actually receive APIs in practice.
- Three-Tier Bug Complexity: Measures detection of simple structural bugs, moderate constraint violations, and complex multi-field semantic errors.
- Automated Live Testing: Every test case runs against real deployed API implementations with objective, reproducible scoring.
- Weighted Scoring System: Prioritizes bug finding (70%), rewards thorough coverage (20%), and penalizes inefficiency (10%) for realistic assessment.
- Multi-Domain Coverage: 20 scenarios across 7 application domains including payments, user management, scheduling, and search.
Use Cases
- AI Agent Evaluation: Compare LLM-based testing agents against a standardized, objective benchmark for API test generation.
- QA Automation Research: Develop and validate new approaches to automated test suite generation for REST APIs.
- Tool Selection: Make data-driven decisions when choosing between coding assistants and specialized testing agents.
Target Audience
APIEval-20 serves AI researchers building testing agents, engineering teams evaluating automation tools, and QA leaders seeking objective metrics to compare agent performance against human-level testing standards.