SWE-bench Pro logo

SWE-bench Pro

Coding

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.

Scale's standardized board lags vendor launches by months: Claude Fable 5, Claude Opus 4.8 and GPT-5.5 have no Scale-run scores yet (Anthropic self-reports Fable 5 at 80.3 with its own harness, which is not comparable).

Tasks are built from curated repositories with reproducible Docker environments, commit-scraped fail-to-pass test transitions, and human-augmented problem statements verified at three checkpoints. The full benchmark has 1,865 problems across a public set (731 GPL-repo instances, the leaderboard shown here), a commercial set of proprietary startup codebases, and a held-out set. A task counts as resolved only if the failing tests pass and existing tests do not regress, and Scale runs every model through the same mini-swe-agent harness, so scores isolate model capability from harness quality. Scores carry 95% confidence intervals, shown as whiskers on the leaderboard.

Leaderboard
0%25%50%75%100%
All results
Share:
Details:

Updated June 2026