Terminal-Bench 2.0

Coding

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.

A newer Terminal-Bench 2.1 board is live but still sparse (11 entries, including Claude Opus 4.8 with Claude Code at 78.9). Scores are not comparable across versions, so this board stays on 2.0 until 2.1 fills out.

Terminal-Bench 2.0 contains 89 verified tasks that each run in a containerized terminal environment via the Harbor framework. An agent gets a task instruction, works inside the sandbox, and the run is scored pass or fail by task-specific checks, so the headline number is the percentage of tasks resolved, published with a confidence interval. Every entry is an agent harness plus model pair, not a bare model: the same model scores differently depending on the scaffold wrapping it (Codex CLI, Claude Code, Droid, the reference Terminus 2 agent and others), with harness quality worth several points. We list each model's strongest harness rows with the harness named on the row; submissions are verified by the Terminal-Bench team.

Leaderboard

GPT-5.5nexau-ahe
84.7%±2.1%
Gemini 3.1 Pro Previewtongagents
80.2%±2.6%
Claude Opus 4.7wozcode
80.2%±2.1%
GPT-5.3-Codexsageagent
78.4%±2.2%
Claude Opus 4.6meta-harness
76.4%±2.4%
Gemini 3 Proante
69.4%±2.1%
GPT-5.2 Codexdeep agents
66.5%±3.1%
GPT-5.2droid
64.9%±2.8%
Gemini 3 Flashjunie cli
64.3%±2.8%
Claude Opus 4.5droid
63.1%±2.7%
GPT-5.1-Codex-Minihookele
61.6%±1.9%
GPT-5.1-Codex-Maxcodex cli
60.4%±2.7%

0%25%50%75%100%

All results

#ModelScoreProvider

1
GPT-5.5NexAU-AHE
84.7%OpenAI
2
GPT-5.5Capy
83.1%OpenAI
3
GPT-5.5Codex CLI
82.2%OpenAI
4
Gemini 3.1 Pro PreviewTongAgents
80.2%Google DeepMind
5
Claude Opus 4.7WOZCODE
80.2%Anthropic
6
GPT-5.3-CodexSageAgent
78.4%OpenAI
7
GPT-5.3-CodexDroid
77.3%OpenAI
8
Claude Opus 4.6Meta-Harness
76.4%Anthropic
9
Claude Opus 4.6Capy
75.3%Anthropic
10
Gemini 3.1 Pro PreviewTerminus-KIRA
74.8%Google DeepMind
11
Gemini 3 ProAnte
69.4%Google DeepMind
12
GPT-5.2 CodexDeep Agents
66.5%OpenAI
13
Gemini 3 ProSageAgent
65.2%Google DeepMind
14
GPT-5.2Droid
64.9%OpenAI
15
Gemini 3 FlashJunie CLI
64.3%Google DeepMind
16
Claude Opus 4.5Droid
63.1%Anthropic
17
GPT-5.2Codex CLI
62.9%OpenAI
18
GPT-5.1-Codex-Minihookele
61.6%OpenAI
19
Gemini 3.1 Pro PreviewGemini CLI
61.4%Google DeepMind
20
GPT-5.1-Codex-MaxCodex CLI
60.4%OpenAI

Sources:

Leaderboard data on Hugging Face Terminal-Bench 2.0 leaderboard Terminal-Bench 2.1 leaderboard laude-institute/terminal-bench

Share:

Details:

Category
Coding
Created by
Laude Institute
Models tested
12
Configs tested
20
Leader
GPT-5.5
Top score
84.7%

Updated June 2026