New

Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.

Sign In

Rankings

Independent model, benchmark, and agent rankings for AI coding, showing what actually leads right now. Refreshed regularly.

Updated June 2026

Best models for your agent

Best models for agents

  • 1
    ClaudeClaude Fable 5
    ProviderAnthropic
    Open
    CursorBench72.9%
    Context1M
    Input price$10/M
  • 2
    ClaudeClaude Opus 4.7
    ProviderAnthropic
    Open
    CursorBench64.8%
    Context1M
    Input price$5/M
  • 3
    OpenAIGPT-5.5
    ProviderOpenAI
    Open
    CursorBench64.3%
    Context1.05M
    Input price$5/M
  • 4
    ClaudeClaude Opus 4.8
    ProviderAnthropic
    Open
    CursorBench63.8%
    Context1M
    Input price$5/M
  • 5
    CursorComposer 2.5
    ProviderCursor
    Open
    CursorBench63.2%
    Context200K
    Input price$0.5/M
  • 6
    CursorComposer 2
    ProviderCursor
    Open
    CursorBench52.2%
    Context200K
    Input price$0.5/M
  • 7
    GeminiGemini 3.5 Flash
    ProviderGoogle
    Open
    CursorBench49.8%
    Context1.049M
    Input price$1.5/M
  • 8
    ClaudeClaude Sonnet 4.6
    ProviderAnthropic
    Open
    CursorBench49%
    Context1M
    Input price$3/M
  • 9
    MoonshotAIKimi K2.6
    ProviderMoonshot
    Open
    CursorBench47.6%
    Context262K
    Input price$0.68/M
  • 10
    MoonshotAIKimi K2.5
    ProviderMoonshot
    Open
    CursorBench31.9%
    Context262K
    Input price$0.35/M

Benchmark leaderboards

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
ClaudeLeaderClaude Fable 564.9
Cursor

CursorBench 3.1

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
ClaudeLeaderClaude Fable 572.9%

FrontierCode Diamond

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
ClaudeLeaderClaude Opus 4.813.4%

FrontierCode Main

Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
ClaudeLeaderClaude Fable 546.3%

Top AI coding agents

  • HermesSkills, Integrations & Self-hosting for Hermes
  • Claude CodeSkills, Plugins & MCP Servers for Claude Code
  • CodexSkills & MCP Servers for OpenAI Codex
  • OpenClawSkills & Automation for OpenClaw