Terminal-Bench 2.0 logo

Terminal-Bench 2.0

Coding

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.

A newer Terminal-Bench 2.1 board is live but still sparse (11 entries, including Claude Opus 4.8 with Claude Code at 78.9). Scores are not comparable across versions, so this board stays on 2.0 until 2.1 fills out.

Terminal-Bench 2.0 contains 89 verified tasks that each run in a containerized terminal environment via the Harbor framework. An agent gets a task instruction, works inside the sandbox, and the run is scored pass or fail by task-specific checks, so the headline number is the percentage of tasks resolved, published with a confidence interval. Every entry is an agent harness plus model pair, not a bare model: the same model scores differently depending on the scaffold wrapping it (Codex CLI, Claude Code, Droid, the reference Terminus 2 agent and others), with harness quality worth several points. We list each model's strongest harness rows with the harness named on the row; submissions are verified by the Terminal-Bench team.

Leaderboard
0%25%50%75%100%
All results
Share:
Details:

Updated June 2026