Aider Polyglot

Name: Aider Polyglot leaderboard
Creator: Aider

Coding

The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.

The board has not been refreshed since November 2025, so current frontier models (Claude Fable 5, Claude Opus 4.8, GPT-5.5) do not appear yet. It remains the reference for the prior generation.

Each model attempts the 225 hardest Exercism practice exercises spanning C++, Go, Java, JavaScript, Python and Rust, driving aider end to end. The model must emit changes in a structured edit format (diff, whole-file, or architect mode), solutions are checked by running each exercise's unit tests, and one retry is allowed after seeing failures: percent correct is the share of tasks passing after that second attempt. Every run also publishes its total USD cost (shown here divided by 225 as cost per task), which makes the board a clean score vs cost frontier. All runs live as YAML in the aider GitHub repo and community result PRs are accepted.

Score vs. cost

Leaderboard

#ModelScoreCost

1
GPT-5High
88%$0.13
2
GPT-5Medium
86.7%$0.08
3
o3 ProHigh
84.9%$0.65
4
Gemini 2.5 Pro Preview 06-0532k thinking
83.1%$0.22
5
GPT-5Low
81.3%$0.05
6
o3High
81.3%$0.09
7
Gemini 2.5 Pro Preview 06-05Default thinking
79.1%$0.20
8
Gemini 2.5 Pro Preview 05-06
76.9%$0.17
9
o3Default
76.9%$0.06
10
DeepSeek V3.2 ExpReasoner
74.2%$0.01
11
Claude Opus 432k thinking
72%$0.29
12
o4 Mini High
72%$0.09
13
R1 0528
71.4%$0.02
14
Claude Opus 4No thinking
70.7%$0.30
15
DeepSeek V3.2 ExpChat
70.2%$0.00
16
Claude Sonnet 432k thinking
61.3%$0.12
17
Kimi K2 0711
59.1%$0.01

Sources:

Raw leaderboard YAML (polyglot_leaderboard.yml)Aider LLM Leaderboards Polyglot benchmark announcement

Share:

Details:

Category
Coding
Created by
Aider
Models tested
11
Configs tested
17
Leader
GPT-5
Top score
88%

Updated November 2025