A test for judging AI on real software engineering tasks.
It is a garage test for AI. The bike is broken. The wrench is missing. The chain still must work.
It compares coding assistants on real code projects. Can they read the project and fix a real bug?
Agentic coding
The benchmark checks if an agent can make real code changes.
Leaderboard
Its scores can go on a leaderboard for easy model comparison.
Benchmark contamination
If tasks leak into training data, the score can look too high.
AI QA Testing
Test cases help prove the fix works and did not break the project.