CursorBench 3.1

Coding

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.

Official source

Score vs. cost

Leaderboard

#ModelScoreCostTokensSteps

1
Claude Fable 5Max
72.9%$18.0263,84276
2
Claude Fable 5Extra High
72%$13.7448,75463
3
Claude Fable 5High
70.6%$10.8137,17354
4
Claude Fable 5Medium
69.8%$8.2728,50747
5
Claude Opus 4.7Max
64.8%$11.0262,98996
6
GPT-5.5Extra High
64.3%$4.3717,90546
7
Claude Fable 5Low
64.2%$5.7018,88236
8
Claude Opus 4.8Max
63.8%$7.5977,37060
9
Composer 2.5
63.2%$0.5515,15237
10
GPT-5.5High
62.6%$3.5913,32940
11
Claude Opus 4.8Extra High
62.1%$6.1455,62254
12
Claude Opus 4.7Extra High
61.6%$7.1143,94272
13
Claude Opus 4.7High
59.4%$5.0132,22759
14
GPT-5.5Medium
59.2%$2.229,06535
15
Claude Opus 4.8High
58.4%$4.4136,78845
16
Claude Opus 4.8Medium
56.6%$3.8331,68441
17
Claude Opus 4.8Low
54.3%$2.9322,72636
18
Claude Opus 4.7Medium
52.7%$2.9319,19341
19
Composer 2
52.2%$0.5614,16340
20
Gemini 3.5 Flash
49.8%$1.9435,10579
21
Claude Sonnet 4.6Max
49%$3.0940,28055
22
Claude Sonnet 4.6High
48.8%$3.0637,35257
23
GPT-5.5Low
48.8%$1.194,92324
24
Claude Opus 4.7Low
48.3%$1.8713,16429
25
Kimi K2.6
47.6%$1.2724,78356
26
Claude Sonnet 4.6Medium
46%$2.6431,36050
27
Claude Sonnet 4.6Low
41.5%$1.8921,21150
28
Kimi K2.5
31.9%$0.879,44630

Share:

Details:

Category
Coding
Created by
Cursor
Models tested
10
Configs tested
28
Leader
Claude Fable 5
Top score
72.9%

Updated June 2026