SWE-bench Pro
CodingScale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.
Scale's standardized board lags vendor launches by months: Claude Fable 5, Claude Opus 4.8 and GPT-5.5 have no Scale-run scores yet (Anthropic self-reports Fable 5 at 80.3 with its own harness, which is not comparable).
Tasks are built from curated repositories with reproducible Docker environments, commit-scraped fail-to-pass test transitions, and human-augmented problem statements verified at three checkpoints. The full benchmark has 1,865 problems across a public set (731 GPL-repo instances, the leaderboard shown here), a commercial set of proprietary startup codebases, and a held-out set. A task counts as resolved only if the failing tests pass and existing tests do not regress, and Scale runs every model through the same mini-swe-agent harness, so scores isolate model capability from harness quality. Scores carry 95% confidence intervals, shown as whiskers on the leaderboard.
Leaderboard
- GPT-5.4xhigh59.1%±3.56%
- Claude Opus 4.6thinking51.9%±3.61%
- Gemini 3.1 Pro Previewthinking46.1%±3.6%
- 45.89%±3.6%
- 43.6%±3.6%
- 43.3%±3.6%
- GPT-5high41.78%±3.49%
- 41.04%±3.57%
- 39.45%±3.55%
- 38.7%±3.55%
- 36.81%±3.55%
- 34.63%±3.55%
- 29.94%±2.15%
- 21.41%±2.25%
- 16.2%±2.67%
- 15.56%±2.63%
- 11.38%±2.15%
0%25%50%75%100%
All results
#ModelScoreProvider
- 1GPT-5.4xHigh59.1%OpenAI
- 2Claude Opus 4.6Thinking51.9%Anthropic
- 3Gemini 3.1 Pro PreviewThinking46.1%Google DeepMind
- 445.89%Anthropic
- 543.6%Anthropic
- 643.3%Google DeepMind
- 7GPT-5High41.78%OpenAI
- 841.04%OpenAI
- 939.45%Anthropic
- 1038.7%Qwen
- 1136.81%MiniMax
- 1234.63%Google DeepMind
- 1329.94%OpenAI
- 1421.41%Qwen
- 1516.2%OpenAI
- 1615.56%DeepSeek
- 1711.38%Google DeepMind
Sources: