Vending-Bench 2
AgenticAndon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.
Each model starts with 500 dollars and operates a vending machine business for 365 simulated days while paying a 2 dollar daily location fee. The agent emails and negotiates with suppliers (some adversarial), places and tracks orders, sets prices, manages inventory and handles complaints, delays and refunds. Runs are extremely long horizon (the original paper reports over 20 million tokens per run), so the benchmark primarily tests whether a model stays coherent and keeps using tools effectively without drifting off task. The score is the final bank balance in dollars, averaged across 5 runs per model, and variance is high: every model has runs that derail through forgotten orders, misread schedules or unproductive loops. Andon Labs estimates a good human strategy would earn roughly 63,000 dollars per year, so even the best models reach only a fraction of human performance.
- 18017.59 USDAnthropic
- 27204.14 USDAnthropic
- 35634.41 USDZ.AI
- 45478.16 USDGoogle DeepMind
- 54967.06 USDAnthropic
- 64432.12 USDZ.AI
- 73838.74 USDAnthropic
- 8Gemini 3.1 Pro PreviewCustom tools3774.25 USDGoogle DeepMind
- 93634.72 USDGoogle DeepMind
- 103591.33 USDOpenAI
- 112376.82 USDZ.AI
Category
AgenticCreated by
Andon LabsModels tested
11Leader
Claude Opus 4.6Top score
8017.59 USD
Updated June 2026