Aider Polyglot
CodingThe practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.
The board has not been refreshed since November 2025, so current frontier models (Claude Fable 5, Claude Opus 4.8, GPT-5.5) do not appear yet. It remains the reference for the prior generation.
Each model attempts the 225 hardest Exercism practice exercises spanning C++, Go, Java, JavaScript, Python and Rust, driving aider end to end. The model must emit changes in a structured edit format (diff, whole-file, or architect mode), solutions are checked by running each exercise's unit tests, and one retry is allowed after seeing failures: percent correct is the share of tasks passing after that second attempt. Every run also publishes its total USD cost (shown here divided by 225 as cost per task), which makes the board a clean score vs cost frontier. All runs live as YAML in the aider GitHub repo and community result PRs are accepted.
Score vs. cost
Leaderboard
#ModelScoreCost
- 1GPT-5High88%$0.13
- 2GPT-5Medium86.7%$0.08
- 3o3 ProHigh84.9%$0.65
- 4Gemini 2.5 Pro Preview 06-0532k thinking83.1%$0.22
- 5GPT-5Low81.3%$0.05
- 6o3High81.3%$0.09
- 7Gemini 2.5 Pro Preview 06-05Default thinking79.1%$0.20
- 876.9%$0.17
- 9o3Default76.9%$0.06
- 10DeepSeek V3.2 ExpReasoner74.2%$0.01
- 11Claude Opus 432k thinking72%$0.29
- 1272%$0.09
- 1371.4%$0.02
- 14Claude Opus 4No thinking70.7%$0.30
- 1570.2%$0.00
- 16Claude Sonnet 432k thinking61.3%$0.12
- 1759.1%$0.01
Sources: