BrowseComp
WebOpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
BrowseComp questions are built with an inverted design: trainers start from a known fact and write a question whose answer does not appear on first-page search results and that another person cannot solve within ten minutes. Answers are short and graded by a model checking semantic equivalence against the reference. The benchmark is essentially unsolvable without tools (GPT-4o scored 1.9% with browsing at launch, and human trainers solved only 29.2% within a two-hour limit), so all rows here are with browsing and tools enabled, single-agent configurations only. Multi-agent harnesses score higher and are excluded to keep rows comparable; there is no single official leaderboard, so scores come from vendor model cards compiled by aggregators.
- 1GPT-5.5 ProBrowsing, parallel compute90.1%OpenAI
- 2GPT-5.4 ProBrowsing89.3%OpenAI
- 3Claude Fable 5Single agent, web search86.9%Anthropic
- 4Gemini 3.1 Pro PreviewSearch + Python + Browse85.9%Google DeepMind
- 5GPT-5.5Browsing84.4%OpenAI
- 6Claude Opus 4.8Single agent, web search84.3%Anthropic
- 7Claude Opus 4.6Max thinking, tools84%Anthropic
- 8MiniMax-M3Browsing83.5%MiniMax
- 9DeepSeek V4Pro, max thinking, browsing83.4%DeepSeek
- 10Kimi K2.6Single agent, tools83.2%Moonshot AI
- 11GPT-5.4Browsing82.7%OpenAI
- 12Claude Opus 4.7Adaptive thinking, web search79.3%Anthropic
- 13GLM 5.1Browsing79.3%Z.AI
- 14GPT-5.2 ProBrowsing77.9%OpenAI
- 15GLM 5Browsing75.9%Z.AI
- 16Claude Sonnet 4.6Max thinking, tools74.7%Anthropic
- 17GPT-5.2xHigh, tools65.8%OpenAI
- 18Kimi K2 ThinkingTools60.2%Moonshot AI
- 19Gemini 3 ProSearch + Python + Browse59.2%Google DeepMind
Category
WebCreated by
OpenAIModels tested
19Leader
GPT-5.5 ProTop score
90.1%
Updated June 2026