APIEval-20: Objective Benchmark for API Testing Agents

More About APIEval-20

APIEval-20

APIEval-20 is the first benchmark designed specifically to evaluate how well AI agents can generate API test suites that actually find bugs—using only a schema and sample payload, with no access to source code or documentation. It measures real-world black-box testing capability across 20 diverse API scenarios spanning e-commerce, payments, authentication, and more.

Product Highlights

Black-Box Evaluation: Tests AI agents with only JSON schema and sample payload—mirroring how developers actually receive APIs in practice.
Three-Tier Bug Complexity: Measures detection of simple structural bugs, moderate constraint violations, and complex multi-field semantic errors.
Automated Live Testing: Every test case runs against real deployed API implementations with objective, reproducible scoring.
Weighted Scoring System: Prioritizes bug finding (70%), rewards thorough coverage (20%), and penalizes inefficiency (10%) for realistic assessment.
Multi-Domain Coverage: 20 scenarios across 7 application domains including payments, user management, scheduling, and search.

Use Cases

AI Agent Evaluation: Compare LLM-based testing agents against a standardized, objective benchmark for API test generation.
QA Automation Research: Develop and validate new approaches to automated test suite generation for REST APIs.
Tool Selection: Make data-driven decisions when choosing between coding assistants and specialized testing agents.

Target Audience

APIEval-20 serves AI researchers building testing agents, engineering teams evaluating automation tools, and QA leaders seeking objective metrics to compare agent performance against human-level testing standards.

APIEval-20 Alternatives