METR 50% Time Horizon
AgenticThe length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.
METR has not yet published horizons for Claude Fable 5, Claude Opus 4.8 or GPT-5.5; the newest measured frontier model is Claude Opus 4.6.
METR runs frontier models as autonomous agents on 228 diverse software and reasoning tasks (the Time Horizon 1.1 suite, built on HCAST, RE-Bench and SWAA). Each task has a human baseline: the time skilled professionals take to complete it. For every model METR fits a logistic curve relating human task length to the model's success probability, and the 50% time horizon is the task length where that curve crosses 50% success. Horizons have grown exponentially, doubling roughly every 129 days since 2023 under the v1.1 analysis. METR cautions that measurements above 16 hours are unreliable with the current task suite, and confidence intervals on the newest models are wide.
Leaderboard
#ModelScoreProvider
- 1718.81 minAnthropic
- 2384.15 minGoogle DeepMind
- 3352.25 minOpenAI
- 4349.53 minOpenAI
- 5341.74 minOpenAI
- 6292.99 minAnthropic
- 7224.33 minGoogle DeepMind
- 8223.71 minOpenAI
- 9203.01 minOpenAI
- 10119.73 minOpenAI
- 11100.47 minAnthropic
- 12100.37 minAnthropic
- 1338.83 minOpenAI
- 143.99 minOpenAI
- 153.73 minOpenAI
- 160.6 minOpenAI
Sources:
Details:
Category
Agentic
Created by
METRModels tested
16Leader
Claude Opus 4.6Top score
718.81 min
Updated May 2026