New
Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.
Sign InRankings
Independent model, benchmark, and agent rankings for AI coding, showing what actually leads right now. Refreshed regularly.
Updated June 2026
Best models for your agent
Best free models for Hermes
Want to run Hermes without paying per token? Every model here costs $0 on OpenRouter (rate limits apply) and still handles real agent work.
Best models for Claude Code
Claude Code is tuned for the Claude family, but it's worth knowing where each model lands on capability and cost for your workload.
Best models for Codex
Codex only runs OpenAI models, so the real choice is which GPT tier fits your work. Here's how the three options in the picker compare on capability and cost.
Best models for Hermes
Hermes runs your skills locally and leans on the model for planning and skill use. These are the models that pair best with it right now, ranked by real agentic-coding performance and day-to-day reliability.
Best models for OpenClaw
OpenClaw lives in your messaging apps and acts on your behalf, so it rewards models with reliable skill use, long sessions, and sane costs. These picks balance capability against the always-on token bill.
Best open-source models for Hermes
Open-weight models you can inspect, fine-tune, and self-host. Ideal for privacy-sensitive or air-gapped Hermes setups.
Best models for agents
#ModelProviderOpenCursorBenchContextInput price
- 1Claude Fable 5ProviderAnthropicOpen—CursorBench72.9%Context1MInput price$10/M
- 2Claude Opus 4.7ProviderAnthropicOpen—CursorBench64.8%Context1MInput price$5/M
- 3GPT-5.5ProviderOpenAIOpen—CursorBench64.3%Context1.05MInput price$5/M
- 4Claude Opus 4.8ProviderAnthropicOpen—CursorBench63.8%Context1MInput price$5/M
- 5Composer 2.5ProviderCursorOpen—CursorBench63.2%Context200KInput price$0.5/M
- 6Composer 2ProviderCursorOpen—CursorBench52.2%Context200KInput price$0.5/M
- 7Gemini 3.5 FlashProviderGoogleOpen—CursorBench49.8%Context1.049MInput price$1.5/M
- 8Claude Sonnet 4.6ProviderAnthropicOpen—CursorBench49%Context1MInput price$3/M
- 9Kimi K2.6ProviderMoonshotOpenCursorBench47.6%Context262KInput price$0.68/M
- 10Kimi K2.5ProviderMoonshotOpenCursorBench31.9%Context262KInput price$0.35/M
Benchmark leaderboards
Artificial Analysis Intelligence Index
The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
LeaderClaude Fable 564.9
CursorBench 3.1
Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
LeaderClaude Fable 572.9%
FrontierCode Diamond
The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
LeaderClaude Opus 4.813.4%
FrontierCode Main
Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
LeaderClaude Fable 546.3%
Top AI coding agents
Agent
HermesSkills, Integrations & Self-hosting for Hermes- Claude CodeSkills, Plugins & MCP Servers for Claude Code
- CodexSkills & MCP Servers for OpenAI Codex
- OpenClawSkills & Automation for OpenClaw