GAIA2

Name: GAIA2 leaderboard
Creator: Meta

Agentic

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

Agents operate in Meta's Agents Research Environments (ARE) mobile universe with 11 working apps (email, calendar, contacts, messaging, shopping, cab, file system) where time flows asynchronously and events fire independently of the agent. The benchmark has 800 human-authored scenarios split across five capabilities: Execution, Search, Adaptability, Time and Ambiguity. Instead of exact-match answers like the original GAIA, an automated verifier checks the agent's write actions against an oracle event graph (0.98 agreement with human labels), so success means producing the right state changes in the right order and on time. Every model runs the same ReAct baseline scaffold over three runs per scenario, which keeps scores attributable to the model rather than a custom harness. 2026 leaderboard rows average the five core splits; paper-baseline rows average seven splits (adding Noise and Agent2Agent) and read a few points lower.

Leaderboard

#ModelScoreProvider

1
Claude Opus 4.6High, ReAct baseline
57%Anthropic
2
GPT-5.5xHigh, ReAct baseline
56.4%OpenAI
3
GPT-5.4High, ReAct baseline
55.6%OpenAI
4
Gemini 3.1 Pro PreviewHigh, ReAct baseline
52%Google DeepMind
5
Claude Sonnet 4.6High, ReAct baseline
51.9%Anthropic
6
GLM 5.1Thinking, ReAct baseline
50.5%Z.AI
7
GPT-5High, paper baseline
42.1%OpenAI
8
Claude Sonnet 4Thinking, paper baseline
37.8%Anthropic
9
Claude Sonnet 4Paper baseline
34.8%Anthropic
10
GPT-5Low, paper baseline
34.6%OpenAI
11
Gemini 2.5 ProPaper baseline
25.8%Google DeepMind
12
gpt-oss-120bHigh, paper baseline
13.7%OpenAI
13
Llama 4 MaverickPaper baseline
7.4%Meta

Sources:

Gaia2 leaderboard (Hugging Face)gaia2 dataset Gaia2 paper (ICLR 2026)facebookresearch/meta-agents-research-environments

Share:

Details:

Category
Agentic
Created by
Meta
Models tested
11
Configs tested
13
Leader
Claude Opus 4.6
Top score
57%

Updated May 2026