OSWorld-Verified
AgenticThe standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
OSWorld defines 369 real desktop tasks spanning Chrome, GIMP, LibreOffice, VLC, VS Code, Thunderbird, OS settings and multi-app workflows. Agents observe raw screenshots and act with mouse and keyboard, and scoring is execution based: per-task scripts verify the final system state rather than judging text. The Verified revision fixed about 300 task and infrastructure issues, and the team reruns submissions under unified settings at fixed step budgets (15, 50 and 100 steps), since the same agent can gain 5 to 10 points with a larger budget. Human performance is estimated around 72%, a bar the best 2026 entries exceed. Official rerun rows and vendor self-reported rows are both listed, with the scaffold or harness named on each row.
- 1Claude Fable 5Vendor harness85%Anthropic
- 2Claude Opus 4.7Pointer Agent, 100 steps83.64%Anthropic
- 3Claude Opus 4.8Vendor harness83.4%Anthropic
- 4Claude Opus 4.7Vendor harness, updated methodology82.3%Anthropic
- 5Claude Sonnet 4.6Pointer Agent, 100 steps81.45%Anthropic
- 6GPT-5.5Vendor harness78.7%OpenAI
- 7Gemini 3.5 FlashVendor harness78.4%Google DeepMind
- 8Gemini 3.1 Pro PreviewOpenAPA, 100 steps78.34%Google DeepMind
- 9MiniMax-M3100 steps75.19%MiniMax
- 10GPT-5.4Vendor harness75%OpenAI
- 11Kimi K2.6100 steps73.06%Moonshot AI
- 12Claude Sonnet 4.6Native, 100 steps72.11%Anthropic
- 13Claude Sonnet 4.5Native, 100 steps62.88%Anthropic
- 1429.6%ByteDance
Category
Agentic
Created by
XLANG LabModels tested
12Configs tested
14Leader
Claude Fable 5Top score
85%
Updated June 2026