METR 50% Time Horizon

Name: METR 50% Time Horizon leaderboard
Creator: METR

Agentic

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.

METR has not yet published horizons for Claude Fable 5, Claude Opus 4.8 or GPT-5.5; the newest measured frontier model is Claude Opus 4.6.

METR runs frontier models as autonomous agents on 228 diverse software and reasoning tasks (the Time Horizon 1.1 suite, built on HCAST, RE-Bench and SWAA). Each task has a human baseline: the time skilled professionals take to complete it. For every model METR fits a logistic curve relating human task length to the model's success probability, and the 50% time horizon is the task length where that curve crosses 50% success. Horizons have grown exponentially, doubling roughly every 129 days since 2023 under the v1.1 analysis. METR cautions that measurements above 16 hours are unreliable with the current task suite, and confidence intervals on the newest models are wide.

Leaderboard

#ModelScoreProvider

1
Claude Opus 4.6
718.81 minAnthropic
2
Gemini 3.1 Pro Preview
384.15 minGoogle DeepMind
3
GPT-5.2
352.25 minOpenAI
4
GPT-5.3-Codex
349.53 minOpenAI
5
GPT-5.4
341.74 minOpenAI
6
Claude Opus 4.5
292.99 minAnthropic
7
Gemini 3 Pro
224.33 minGoogle DeepMind
8
GPT-5.1-Codex-Max
223.71 minOpenAI
9
GPT-5
203.01 minOpenAI
10
o3
119.73 minOpenAI
11
Claude Opus 4.1
100.47 minAnthropic
12
Claude Opus 4
100.37 minAnthropic
13
o1
38.83 minOpenAI
14
GPT-4
3.99 minOpenAI
15
GPT-4 Turbo
3.73 minOpenAI
16
GPT-3.5 Turbo Instruct
0.6 minOpenAI

Sources:

Per-model results YAML (benchmark_results_1_1.yaml)METR time horizons data page Time Horizon 1.1 (METR blog)Measuring AI Ability to Complete Long Tasks (arXiv 2503.14499)METR/eval-analysis-public

Share:

Details:

Category
Agentic
Created by
METR
Models tested
16
Leader
Claude Opus 4.6
Top score
718.81 min

Updated May 2026