Agents Directory
Browse
Skills
Rankings
Agents
Sign In
Categories
Models
Benchmarks
Skills
Rankings
Agents
About
New
Work in progress
: Agents Directory has just launched. Stay tuned, more content is on the way.
Sign In
/
Benchmarks
AI coding benchmarks
The evaluations that matter for agentic coding, and which models top each leaderboard.
Updated June 2026
Models
Providers
Rankings
Filters
Intelligence
Artificial Analysis Intelligence Index
The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
Claude
Leader
Claude Fable 5
64.9
Coding
Cursor
CursorBench 3.1
Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
Claude
Leader
Claude Fable 5
72.9
%
Coding
FrontierCode Diamond
The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
Claude
Leader
Claude Opus 4.8
13.4
%
Coding
FrontierCode Main
Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
Claude
Leader
Claude Fable 5
46.3
%
Agents Directory
Browse
Skills
Rankings
Agents
Sign In
Categories
Models
Benchmarks
Skills
Rankings
Agents
About