How Claude Fable 5 ranks on benchmarks

Anthropic's new Mythos-class model tops CursorBench 3.1 and posts the strongest agentic-coding scores reported so far. The numbers, with the one caveat that matters.

3 min readWritten byAgents Directory's profileAgents Directory

Anthropic released Claude Fable 5 on June 9. It's the company's first Mythos-class model, priced at $10 in / $50 out per million tokens with a 1M-token context window and built for long-running autonomous work. Here is where it lands, sourced from Anthropic's announcement and the independent CursorBench leaderboard.

CursorBench 3.1

CursorBench evaluates models on ambiguous, multi-file tasks taken from real Cursor sessions. It's the closest thing we have to a production agentic-coding benchmark.

  • Fable 5 high (default): 70.6% at $10.81 per task, more than 7 points clear of every other default configuration.
  • Fable 5 Max: 72.9%, the top score on the whole leaderboard.
  • Next-best defaults: Cursor's Composer 2.5 at 63.2% ($0.55 per task, the value outlier), GPT-5.5 high at 62.6%, Claude Opus 4.8 high at 58.4%.

Full interactive leaderboard on our CursorBench page.

Artificial Analysis Intelligence Index

Artificial Analysis publishes a composite 0-100 intelligence score that blends knowledge, reasoning, math, coding, and agentic evaluations. It is the most widely cited all-up benchmark outside vendor tables.

  • Fable 5 (default): 64.9, the top score in our catalog, about 7 points above Claude Opus 4.7 (57.3) and Gemini 3.1 Pro Preview (57.2).
  • Next in this cut: Qwen3.7 Max (56.6), Gemini 3.5 Flash (55.3), MiniMax-M3 (54.7), Grok 4.3 high (53.2).

Full interactive leaderboard on our Intelligence Index page.

Anthropic's reported numbers

From the announcement, against Claude Opus 4.8, GPT-5.5 and Gemini 3.1 Pro, with the best score per row highlighted:

Claude Mythos 5 / Fable 5Claude Mythos PreviewClaude Opus 4.8GPT-5.5Gemini 3.1 Pro
Agentic coding
SWE-Bench Pro
80.3%77.8%69.2%58.6%54.2%
Agentic coding
FrontierCode (Diamond), xhigh
29.3%13.4%5.7%
Agentic coding
Terminal-Bench 2.1
88.0%*82.7%83.4%
Codex CLI
70.7%
Gemini CLI
Knowledge work
GDPval-AA
1932189017691314
Knowledge work vision
GDP.pdf, no tools
29.8%22.5%24.9%16.7%
Spatial reasoning
Blueprint-Bench 2
38.6%14.5%36.2%26.5%
Tool use
AutomationBench
17.4%15.5%12.9%9.6%
Computer use
OSWorld-Verified
85.0%85.4%83.4%78.7%76.2%
Legal
Legal Agent Benchmark
13.3%10.4%2.1%0.0%
Multidisciplinary reasoning
Humanity's Last Exam, no tools
59.0%*56.8%49.8%41.4%44.4%
Multidisciplinary reasoning
Humanity's Last Exam, with tools
64.5%*64.7%57.9%52.2%51.4%
Biology
BioMysteryBench, hard
46.1%*29.6%40.0%
Biology
BioMysteryBench, human solved
83.9%*82.6%80.4%
Cybersecurity
ExploitBench (Cap)
78.0%*69.0%40.0%34.0%
Health
HealthBench Professional
66.0%*64.7%56.9%51.8%

Anthropic reports the higher score of Claude Mythos 5 and Claude Fable 5; the two land within 1-3 points of each other except on starred (*) benchmarks. See the Mythos caveat below.

The Mythos caveat

Anthropic reports the higher score of two models: Claude Mythos 5 (the identical model with safeguards lifted, restricted to vetted researchers) and the generally available Fable 5. The two land within 1-3 points of each other except on starred (*) benchmarks, where Fable 5's safeguards redirect cybersecurity and biology queries to Opus 4.8 (under 5% of sessions). On those, Fable 5's effective score sits closer to Opus 4.8.

Bottom line

At twice Opus 4.8's price, Fable 5 is not the default for everything. But on long-horizon agentic coding it is currently the strongest model available, and the CursorBench cost curve shows the premium buying capability, not just tokens. Pricing, host availability, and sources on the Claude Fable 5 model page.

Sources

Share: