D
DeepSWE
CodingDatacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
Score vs. cost
Leaderboard
#ModelScoreCost
- 1GPT-5.5Extra High70%$6.80
- 2GPT-5.5High62%$4.60
- 358%$8.50
- 4Claude Opus 4.8Extra High57%$7.00
- 5GPT-5.4Extra High56%$5.50
- 654%$16.50
- 7Claude Opus 4.8High50%$4.50
- 8GPT-5.5Medium48%$2.40
- 9Claude Opus 4.8Medium47%$3.30
- 10Claude Opus 4.7Extra High45%$11.50
- 11Claude Opus 4.7High40%$5.00
- 1232%$4.50
- 13Claude Opus 4.7Medium32%$3.30
- 14Gemini 3.5 FlashMedium28%$7.00
- 1524%$4.50
- 16GPT-5.4 MiniExtra High24%$1.50
- 1720.5%$5.50
- 1810%$2.00
- 198%$5.50
- 205%$1.50
Details:
Category
CodingDCreated by
DatacurveModels tested
12Configs tested
20Leader
GPT-5.5Top score
70%
Updated June 2026