SWE-bench Verified logo

SWE-bench Verified

Coding

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.

Vendor harness rows and standardized mini-SWE-agent rows are listed side by side; compare models within one regime, not across them.

SWE-bench Verified is a 500-instance subset of SWE-bench, screened by human annotators (an OpenAI collaboration with the Princeton authors) to remove underspecified or unsolvable tasks. Each task pairs a real GitHub issue with a repository snapshot, and the model must produce a patch that makes the issue's failing tests pass without breaking existing tests. Two score regimes coexist: vendors self-report scores with their own agent harnesses (the headline rows here), while the official swebench.com leaderboard reruns models on a standardized mini-SWE-agent scaffold that typically lands 10 to 20 points lower and publishes an average cost per task (those rows power the score vs cost chart). Tasks are Python-only issues from pre-2023 repos, so contamination is a standing concern and top scores are nearing saturation.

Score vs. cost
Leaderboard
Share:
Details:

Updated June 2026