OSWorld-Verified logo

OSWorld-Verified

Agentic

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

OSWorld defines 369 real desktop tasks spanning Chrome, GIMP, LibreOffice, VLC, VS Code, Thunderbird, OS settings and multi-app workflows. Agents observe raw screenshots and act with mouse and keyboard, and scoring is execution based: per-task scripts verify the final system state rather than judging text. The Verified revision fixed about 300 task and infrastructure issues, and the team reruns submissions under unified settings at fixed step budgets (15, 50 and 100 steps), since the same agent can gain 5 to 10 points with a larger budget. Human performance is estimated around 72%, a bar the best 2026 entries exceed. Official rerun rows and vendor self-reported rows are both listed, with the scaffold or harness named on each row.

Leaderboard
Share:
Details:

Updated June 2026