OSWorld-Verified

Agentic

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

OSWorld defines 369 real desktop tasks spanning Chrome, GIMP, LibreOffice, VLC, VS Code, Thunderbird, OS settings and multi-app workflows. Agents observe raw screenshots and act with mouse and keyboard, and scoring is execution based: per-task scripts verify the final system state rather than judging text. The Verified revision fixed about 300 task and infrastructure issues, and the team reruns submissions under unified settings at fixed step budgets (15, 50 and 100 steps), since the same agent can gain 5 to 10 points with a larger budget. Human performance is estimated around 72%, a bar the best 2026 entries exceed. Official rerun rows and vendor self-reported rows are both listed, with the scaffold or harness named on each row.

Leaderboard

#ModelScoreProvider

1
Claude Fable 5Vendor harness
85%Anthropic
2
Claude Opus 4.7Pointer Agent, 100 steps
83.64%Anthropic
3
Claude Opus 4.8Vendor harness
83.4%Anthropic
4
Claude Opus 4.7Vendor harness, updated methodology
82.3%Anthropic
5
Claude Sonnet 4.6Pointer Agent, 100 steps
81.45%Anthropic
6
GPT-5.5Vendor harness
78.7%OpenAI
7
Gemini 3.5 FlashVendor harness
78.4%Google DeepMind
8
Gemini 3.1 Pro PreviewOpenAPA, 100 steps
78.34%Google DeepMind
9
MiniMax-M3100 steps
75.19%MiniMax
10
GPT-5.4Vendor harness
75%OpenAI
11
Kimi K2.6100 steps
73.06%Moonshot AI
12
Claude Sonnet 4.6Native, 100 steps
72.11%Anthropic
13
Claude Sonnet 4.5Native, 100 steps
62.88%Anthropic
14
UUI-TARS 7B100 steps
29.6%ByteDance

Sources:

OSWorld-Verified results file (xlsx)OSWorld official leaderboard Introducing OSWorld-Verified OSWorld paper (arXiv 2404.07972)xlang-ai/OSWorld

Share:

Details:

Category
Agentic
Created by
XLANG Lab
Models tested
12
Configs tested
14
Leader
Claude Fable 5
Top score
85%

Updated June 2026