CursorBench 3.1
CodingAmbiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
Score vs. cost
Leaderboard
#ModelScoreCostTokensSteps
- 172.9%$18.0263,84276
- 2Claude Fable 5Extra High72%$13.7448,75463
- 3Claude Fable 5High70.6%$10.8137,17354
- 4Claude Fable 5Medium69.8%$8.2728,50747
- 564.8%$11.0262,98996
- 6GPT-5.5Extra High64.3%$4.3717,90546
- 764.2%$5.7018,88236
- 863.8%$7.5977,37060
- 963.2%$0.5515,15237
- 10GPT-5.5High62.6%$3.5913,32940
- 11Claude Opus 4.8Extra High62.1%$6.1455,62254
- 12Claude Opus 4.7Extra High61.6%$7.1143,94272
- 13Claude Opus 4.7High59.4%$5.0132,22759
- 14GPT-5.5Medium59.2%$2.229,06535
- 15Claude Opus 4.8High58.4%$4.4136,78845
- 16Claude Opus 4.8Medium56.6%$3.8331,68441
- 1754.3%$2.9322,72636
- 18Claude Opus 4.7Medium52.7%$2.9319,19341
- 1952.2%$0.5614,16340
- 2049.8%$1.9435,10579
- 2149%$3.0940,28055
- 2248.8%$3.0637,35257
- 23GPT-5.5Low48.8%$1.194,92324
- 2448.3%$1.8713,16429
- 2547.6%$1.2724,78356
- 26Claude Sonnet 4.6Medium46%$2.6431,36050
- 2741.5%$1.8921,21150
- 2831.9%$0.879,44630
Details:
Category
CodingCreated by
CursorModels tested
10Configs tested
28Leader
Claude Fable 5Top score
72.9%
Updated June 2026