SWE-bench Verified
CodingThe most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.
SWE-bench Verified is a 500-instance subset of SWE-bench, screened by human annotators (an OpenAI collaboration with the Princeton authors) to remove underspecified or unsolvable tasks. Each task pairs a real GitHub issue with a repository snapshot, and the model must produce a patch that makes the issue's failing tests pass without breaking existing tests. Two score regimes coexist: vendors self-report scores with their own agent harnesses (the headline rows here), while the official swebench.com leaderboard reruns models on a standardized mini-SWE-agent scaffold that typically lands 10 to 20 points lower and publishes an average cost per task (those rows power the score vs cost chart). Tasks are Python-only issues from pre-2023 repos, so contamination is a standing concern and top scores are nearing saturation.
- 1Claude Fable 5Vendor harness95%—
- 2GPT-5.5Vendor harness88.7%—
- 3Claude Opus 4.8Vendor harness88.6%—
- 4Claude Opus 4.7Vendor harness87.6%—
- 5DeepSeek V4Pro Max, vendor harness80.6%—
- 6Gemini 3.1 Pro PreviewVendor harness80.6%—
- 7MiniMax-M3Vendor harness80.5%—
- 8Qwen3.7 MaxVendor harness80.4%—
- 9Kimi K2.6Vendor harness80.2%—
- 10GPT-5.2Vendor harness80%—
- 11Claude Sonnet 4.6Vendor harness79.6%—
- 12Claude Opus 4.5live-SWE-agent79.2%—
- 13DeepSeek V4 FlashMax, vendor harness79%—
- 1478.9%—
- 15Mistral Medium 3.5Vendor harness77.6%—
- 16Claude Opus 4.5mini-SWE-agent, High76.8%$0.75
- 17Gemini 3 Flashmini-SWE-agent, High75.8%$0.36
- 18MiniMax M2.5mini-SWE-agent, High75.8%$0.07
- 19Claude Opus 4.6mini-SWE-agent75.6%$0.55
- 20GLM 5mini-SWE-agent, High72.8%$0.53
- 21GPT-5.2 Codexmini-SWE-agent72.8%$0.45
- 22GPT-5.2mini-SWE-agent, High72.8%$0.47
- 23Claude Sonnet 4.5mini-SWE-agent, High71.4%$0.66
Category
CodingCreated by
SWE-benchModels tested
21Configs tested
23Leader
Claude Fable 5Top score
95%
Updated June 2026