FrontierCode Diamond
CodingThe 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
Claude Fable 5's Diamond score is still pending from Cognition, so it does not yet appear on this board.
Diamond is the 50 most difficult FrontierCode tasks, the hardest production-code problems from real open source repositories. They use the same maintainer-merge rubric with hard blocking criteria (correctness, regression safety, scope). Score is the gated weighted rubric value, counted only once a trial clears every blocker (else 0), averaged over the tasks. As the toughest agentic-coding measure on the board, scores stay low.
Leaderboard
#ModelScoreProvider
- 113.4%Anthropic
- 26.3%OpenAI
- 35.2%Anthropic
- 44.7%Google DeepMind
- 54.6%OpenAI
- 63.8%Moonshot AI
- 73.5%Anthropic
- 82.4%MiniMax
- 91.1%MiniMax
- 101%Moonshot AI
- 110.7%Google DeepMind
Details:
Category
CodingCreated by
CognitionModels tested
11Leader
Claude Opus 4.8Top score
13.4%
Updated June 2026