logo
APIEval-20 logo

APIEval-20Finally, prove your agent actually finds bugs—not just writes pretty tests.

Open black-box benchmark for AI API testing agents. Objective scoring on bug detection, coverage & efficiency using live APIs with planted bugs.

APIEval-20 screenshot

More About APIEval-20

APIEval-20

APIEval-20 is the first benchmark designed specifically to evaluate how well AI agents can generate API test suites that actually find bugs—using only a schema and sample payload, with no access to source code or documentation. It measures real-world black-box testing capability across 20 diverse API scenarios spanning e-commerce, payments, authentication, and more.

Product Highlights

  • Black-Box Evaluation: Tests AI agents with only JSON schema and sample payload—mirroring how developers actually receive APIs in practice.
  • Three-Tier Bug Complexity: Measures detection of simple structural bugs, moderate constraint violations, and complex multi-field semantic errors.
  • Automated Live Testing: Every test case runs against real deployed API implementations with objective, reproducible scoring.
  • Weighted Scoring System: Prioritizes bug finding (70%), rewards thorough coverage (20%), and penalizes inefficiency (10%) for realistic assessment.
  • Multi-Domain Coverage: 20 scenarios across 7 application domains including payments, user management, scheduling, and search.

Use Cases

  • AI Agent Evaluation: Compare LLM-based testing agents against a standardized, objective benchmark for API test generation.
  • QA Automation Research: Develop and validate new approaches to automated test suite generation for REST APIs.
  • Tool Selection: Make data-driven decisions when choosing between coding assistants and specialized testing agents.

Target Audience

APIEval-20 serves AI researchers building testing agents, engineering teams evaluating automation tools, and QA leaders seeking objective metrics to compare agent performance against human-level testing standards.