OpenAI

BrowseComp

Web

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.

BrowseComp questions are built with an inverted design: trainers start from a known fact and write a question whose answer does not appear on first-page search results and that another person cannot solve within ten minutes. Answers are short and graded by a model checking semantic equivalence against the reference. The benchmark is essentially unsolvable without tools (GPT-4o scored 1.9% with browsing at launch, and human trainers solved only 29.2% within a two-hour limit), so all rows here are with browsing and tools enabled, single-agent configurations only. Multi-agent harnesses score higher and are excluded to keep rows comparable; there is no single official leaderboard, so scores come from vendor model cards compiled by aggregators.

Leaderboard
Share:
Details:

Updated June 2026