SWE-bench Pro

Name: SWE-bench Pro leaderboard
Creator: Scale AI

Coding

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.

Scale's standardized board lags vendor launches by months: Claude Fable 5, Claude Opus 4.8 and GPT-5.5 have no Scale-run scores yet (Anthropic self-reports Fable 5 at 80.3 with its own harness, which is not comparable).

Tasks are built from curated repositories with reproducible Docker environments, commit-scraped fail-to-pass test transitions, and human-augmented problem statements verified at three checkpoints. The full benchmark has 1,865 problems across a public set (731 GPL-repo instances, the leaderboard shown here), a commercial set of proprietary startup codebases, and a held-out set. A task counts as resolved only if the failing tests pass and existing tests do not regress, and Scale runs every model through the same mini-swe-agent harness, so scores isolate model capability from harness quality. Scores carry 95% confidence intervals, shown as whiskers on the leaderboard.

Leaderboard

GPT-5.4xhigh
59.1%±3.56%
Claude Opus 4.6thinking
51.9%±3.61%
Gemini 3.1 Pro Previewthinking
46.1%±3.6%
Claude Opus 4.5
45.89%±3.6%
Claude Sonnet 4.5
43.6%±3.6%
Gemini 3 Pro
43.3%±3.6%
GPT-5high
41.78%±3.49%
GPT-5.2 Codex
41.04%±3.57%
Claude Haiku 4.5
39.45%±3.55%
Qwen3 Coder 480B A35B
38.7%±3.55%
MiniMax M2.1
36.81%±3.55%
Gemini 3 Flash
34.63%±3.55%
GPT-5.2
29.94%±2.15%
Qwen3 235B A22B
21.41%±2.25%
gpt-oss-120b
16.2%±2.67%
DeepSeek V3.2
15.56%±2.63%
Gemma 3 27B
11.38%±2.15%

0%25%50%75%100%

All results

#ModelScoreProvider

1
GPT-5.4xHigh
59.1%OpenAI
2
Claude Opus 4.6Thinking
51.9%Anthropic
3
Gemini 3.1 Pro PreviewThinking
46.1%Google DeepMind
4
Claude Opus 4.5
45.89%Anthropic
5
Claude Sonnet 4.5
43.6%Anthropic
6
Gemini 3 Pro
43.3%Google DeepMind
7
GPT-5High
41.78%OpenAI
8
GPT-5.2 Codex
41.04%OpenAI
9
Claude Haiku 4.5
39.45%Anthropic
10
Qwen3 Coder 480B A35B
38.7%Qwen
11
MiniMax M2.1
36.81%MiniMax
12
Gemini 3 Flash
34.63%Google DeepMind
13
GPT-5.2
29.94%OpenAI
14
Qwen3 235B A22B
21.41%Qwen
15
gpt-oss-120b
16.2%OpenAI
16
DeepSeek V3.2
15.56%DeepSeek
17
Gemma 3 27B
11.38%Google DeepMind

Sources:

morphllm SWE-bench Pro tracker Scale SEAL SWE-bench Pro (public set)ScaleAI/SWE-bench_Pro on Hugging Face SWE-bench Pro paper (arXiv 2509.16941)scaleapi/SWE-bench_Pro-os

Share:

Details:

Category
Coding
Created by
Scale AI
Models tested
17
Leader
GPT-5.4
Top score
59.1%

Updated June 2026