GAIA2
AgenticMeta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
Agents operate in Meta's Agents Research Environments (ARE) mobile universe with 11 working apps (email, calendar, contacts, messaging, shopping, cab, file system) where time flows asynchronously and events fire independently of the agent. The benchmark has 800 human-authored scenarios split across five capabilities: Execution, Search, Adaptability, Time and Ambiguity. Instead of exact-match answers like the original GAIA, an automated verifier checks the agent's write actions against an oracle event graph (0.98 agreement with human labels), so success means producing the right state changes in the right order and on time. Every model runs the same ReAct baseline scaffold over three runs per scenario, which keeps scores attributable to the model rather than a custom harness. 2026 leaderboard rows average the five core splits; paper-baseline rows average seven splits (adding Noise and Agent2Agent) and read a few points lower.
- 1Claude Opus 4.6High, ReAct baseline57%Anthropic
- 2GPT-5.5xHigh, ReAct baseline56.4%OpenAI
- 3GPT-5.4High, ReAct baseline55.6%OpenAI
- 4Gemini 3.1 Pro PreviewHigh, ReAct baseline52%Google DeepMind
- 5Claude Sonnet 4.6High, ReAct baseline51.9%Anthropic
- 6GLM 5.1Thinking, ReAct baseline50.5%Z.AI
- 7GPT-5High, paper baseline42.1%OpenAI
- 8Claude Sonnet 4Thinking, paper baseline37.8%Anthropic
- 9Claude Sonnet 4Paper baseline34.8%Anthropic
- 10GPT-5Low, paper baseline34.6%OpenAI
- 11Gemini 2.5 ProPaper baseline25.8%Google DeepMind
- 12gpt-oss-120bHigh, paper baseline13.7%OpenAI
- 13Llama 4 MaverickPaper baseline7.4%Meta
Category
AgenticCreated by
MetaModels tested
11Configs tested
13Leader
Claude Opus 4.6Top score
57%
Updated May 2026