SWE-bench Verified

Name: SWE-bench Verified leaderboard
Creator: SWE-bench

Coding

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.

Vendor harness rows and standardized mini-SWE-agent rows are listed side by side; compare models within one regime, not across them.

SWE-bench Verified is a 500-instance subset of SWE-bench, screened by human annotators (an OpenAI collaboration with the Princeton authors) to remove underspecified or unsolvable tasks. Each task pairs a real GitHub issue with a repository snapshot, and the model must produce a patch that makes the issue's failing tests pass without breaking existing tests. Two score regimes coexist: vendors self-report scores with their own agent harnesses (the headline rows here), while the official swebench.com leaderboard reruns models on a standardized mini-SWE-agent scaffold that typically lands 10 to 20 points lower and publishes an average cost per task (those rows power the score vs cost chart). Tasks are Python-only issues from pre-2023 repos, so contamination is a standing concern and top scores are nearing saturation.

Score vs. cost

Leaderboard

#ModelScoreCost

1
Claude Fable 5Vendor harness
95%—
2
GPT-5.5Vendor harness
88.7%—
3
Claude Opus 4.8Vendor harness
88.6%—
4
Claude Opus 4.7Vendor harness
87.6%—
5
DeepSeek V4Pro Max, vendor harness
80.6%—
6
Gemini 3.1 Pro PreviewVendor harness
80.6%—
7
MiniMax-M3Vendor harness
80.5%—
8
Qwen3.7 MaxVendor harness
80.4%—
9
Kimi K2.6Vendor harness
80.2%—
10
GPT-5.2Vendor harness
80%—
11
Claude Sonnet 4.6Vendor harness
79.6%—
12
Claude Opus 4.5live-SWE-agent
79.2%—
13
DeepSeek V4 FlashMax, vendor harness
79%—
14
MMiMo-V2.5-ProVendor harness
78.9%—
15
Mistral Medium 3.5Vendor harness
77.6%—
16
Claude Opus 4.5mini-SWE-agent, High
76.8%$0.75
17
Gemini 3 Flashmini-SWE-agent, High
75.8%$0.36
18
MiniMax M2.5mini-SWE-agent, High
75.8%$0.07
19
Claude Opus 4.6mini-SWE-agent
75.6%$0.55
20
GLM 5mini-SWE-agent, High
72.8%$0.53
21
GPT-5.2 Codexmini-SWE-agent
72.8%$0.45
22
GPT-5.2mini-SWE-agent, High
72.8%$0.47
23
Claude Sonnet 4.5mini-SWE-agent, High
71.4%$0.66

Sources:

Official leaderboard data (leaderboards.json)SWE-bench official leaderboard OpenAI: Introducing SWE-bench Verified SWE-bench_Verified on Hugging Face SWE-bench paper (ICLR 2024)

Share:

Details:

Category
Coding
Created by
SWE-bench
Models tested
21
Configs tested
23
Leader
Claude Fable 5
Top score
95%

Updated June 2026