Vending-Bench 2

Name: Vending-Bench 2 leaderboard
Creator: Andon Labs

Agentic

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.

Andon Labs has not yet run Claude Fable 5, Claude Opus 4.8 or GPT-5.5 on Vending-Bench 2; the newest Anthropic entry is Claude Opus 4.6.

Each model starts with 500 dollars and operates a vending machine business for 365 simulated days while paying a 2 dollar daily location fee. The agent emails and negotiates with suppliers (some adversarial), places and tracks orders, sets prices, manages inventory and handles complaints, delays and refunds. Runs are extremely long horizon (the original paper reports over 20 million tokens per run), so the benchmark primarily tests whether a model stays coherent and keeps using tools effectively without drifting off task. The score is the final bank balance in dollars, averaged across 5 runs per model, and variance is high: every model has runs that derail through forgotten orders, misread schedules or unproductive loops. Andon Labs estimates a good human strategy would earn roughly 63,000 dollars per year, so even the best models reach only a fraction of human performance.

Leaderboard

#ModelScoreProvider

1
Claude Opus 4.6
8017.59 USDAnthropic
2
Claude Sonnet 4.6
7204.14 USDAnthropic
3
GLM 5.1
5634.41 USDZ.AI
4
Gemini 3 Pro
5478.16 USDGoogle DeepMind
5
Claude Opus 4.5
4967.06 USDAnthropic
6
GLM 5
4432.12 USDZ.AI
7
Claude Sonnet 4.5
3838.74 USDAnthropic
8
Gemini 3.1 Pro PreviewCustom tools
3774.25 USDGoogle DeepMind
9
Gemini 3 Flash
3634.72 USDGoogle DeepMind
10
GPT-5.2
3591.33 USDOpenAI
11
GLM 4.7
2376.82 USDZ.AI

Sources:

Vending-Bench 2 (Andon Labs)Vending-Bench v1 (legacy)Vending-Bench paper (arXiv 2502.15840)

Share:

Details:

Category
Agentic
Created by
Andon Labs
Models tested
11
Leader
Claude Opus 4.6
Top score
8017.59 USD

Updated June 2026