Terminal-Bench 2.0
CodingMeasures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.
Terminal-Bench 2.0 contains 89 verified tasks that each run in a containerized terminal environment via the Harbor framework. An agent gets a task instruction, works inside the sandbox, and the run is scored pass or fail by task-specific checks, so the headline number is the percentage of tasks resolved, published with a confidence interval. Every entry is an agent harness plus model pair, not a bare model: the same model scores differently depending on the scaffold wrapping it (Codex CLI, Claude Code, Droid, the reference Terminus 2 agent and others), with harness quality worth several points. We list each model's strongest harness rows with the harness named on the row; submissions are verified by the Terminal-Bench team.
- GPT-5.5nexau-ahe84.7%±2.1%
- Gemini 3.1 Pro Previewtongagents80.2%±2.6%
- Claude Opus 4.7wozcode80.2%±2.1%
- GPT-5.3-Codexsageagent78.4%±2.2%
- Claude Opus 4.6meta-harness76.4%±2.4%
- Gemini 3 Proante69.4%±2.1%
- GPT-5.2 Codexdeep agents66.5%±3.1%
- GPT-5.2droid64.9%±2.8%
- Gemini 3 Flashjunie cli64.3%±2.8%
- Claude Opus 4.5droid63.1%±2.7%
- GPT-5.1-Codex-Minihookele61.6%±1.9%
- GPT-5.1-Codex-Maxcodex cli60.4%±2.7%
- 1GPT-5.5NexAU-AHE84.7%OpenAI
- 2GPT-5.5Capy83.1%OpenAI
- 3GPT-5.5Codex CLI82.2%OpenAI
- 4Gemini 3.1 Pro PreviewTongAgents80.2%Google DeepMind
- 5Claude Opus 4.7WOZCODE80.2%Anthropic
- 6GPT-5.3-CodexSageAgent78.4%OpenAI
- 7GPT-5.3-CodexDroid77.3%OpenAI
- 8Claude Opus 4.6Meta-Harness76.4%Anthropic
- 9Claude Opus 4.6Capy75.3%Anthropic
- 10Gemini 3.1 Pro PreviewTerminus-KIRA74.8%Google DeepMind
- 11Gemini 3 ProAnte69.4%Google DeepMind
- 12GPT-5.2 CodexDeep Agents66.5%OpenAI
- 13Gemini 3 ProSageAgent65.2%Google DeepMind
- 14GPT-5.2Droid64.9%OpenAI
- 15Gemini 3 FlashJunie CLI64.3%Google DeepMind
- 16Claude Opus 4.5Droid63.1%Anthropic
- 17GPT-5.2Codex CLI62.9%OpenAI
- 18GPT-5.1-Codex-Minihookele61.6%OpenAI
- 19Gemini 3.1 Pro PreviewGemini CLI61.4%Google DeepMind
- 20GPT-5.1-Codex-MaxCodex CLI60.4%OpenAI
Category
CodingCreated by
Laude InstituteModels tested
12Configs tested
20Leader
GPT-5.5Top score
84.7%
Updated June 2026