Claude Sonnet 4.6 vs GPT-5.4
The workhorse tier, where most real coding spend actually goes: Claude Sonnet 4.6 against GPT-5.4. Neither is the headline flagship; both are what teams quietly run all day.
GPT-5.4 is the better value of the two workhorses: $2.50 per million input tokens versus Sonnet 4.6's $3, with a stronger showing on our boards (56.8 versus 51.7 on the Intelligence Index at best settings, and a standout 56% on DeepSWE).
Sonnet 4.6 remains the dependable pick inside the Anthropic ecosystem: if your agent, skills, and workflows are Claude-based, it is the model you run when Opus or Fable is overkill, and its quality on everyday edits is well-proven.
If you are choosing fresh with no ecosystem pull, GPT-5.4 wins this tier. If you are already on Claude, Sonnet 4.6 is close enough that switching costs are not worth it.
Prices and context are synced from live provider listings. Deep dives: Claude Sonnet 4.6 and GPT-5.4.
Best published configuration per model. Every config and source is on the benchmark leaderboards.
Every published configuration for Claude Sonnet 4.6 and GPT-5.4 on the benchmarks they share, charted side by side. Only these two models are plotted.
DeepSWE
Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
Tau2-Bench Telecom
Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
OSWorld-Verified
The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
GAIA2
Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
BrowseComp
OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
Artificial Analysis Intelligence Index
The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
Is GPT-5.4 better than Claude Sonnet 4.6?
By our boards, modestly yes: GPT-5.4 scores higher on the Intelligence Index (56.8 versus 51.7 at best settings) and costs less ($2.50 versus $3 per million input tokens). Sonnet 4.6 stays competitive on everyday coding quality and is the natural pick if you are already in the Claude ecosystem.
Should I use a workhorse model or a flagship for coding agents?
Workhorses for the everyday 90%: models like GPT-5.4 and Sonnet 4.6 handle routine edits, tests, and refactors at a third of flagship prices. Route the genuinely hard tasks (architecture, gnarly debugging, long autonomous runs) to a flagship. Most agent setups support per-task model switching, so use it.
Type
Model comparisonClaude Sonnet 4.6
Model pageGPT-5.4
Model pageUpdated
June 2026