Claude Sonnet 4.6 vs GPT-5.4

The workhorse tier, where most real coding spend actually goes: Claude Sonnet 4.6 against GPT-5.4. Neither is the headline flagship; both are what teams quietly run all day.

The verdict

GPT-5.4 is the better value of the two workhorses: $2.50 per million input tokens versus Sonnet 4.6's $3, with a stronger showing on our boards (56.8 versus 51.7 on the Intelligence Index at best settings, and a standout 56% on DeepSWE).

Sonnet 4.6 remains the dependable pick inside the Anthropic ecosystem: if your agent, skills, and workflows are Claude-based, it is the model you run when Opus or Fable is overkill, and its quality on everyday edits is well-proven.

If you are choosing fresh with no ecosystem pull, GPT-5.4 wins this tier. If you are already on Claude, Sonnet 4.6 is close enough that switching costs are not worth it.

The facts, side by side
ClaudeClaude Sonnet 4.6OpenAIGPT-5.4
ProviderAnthropicOpenAI
Input price$3/M / 1M tokens$2.5/M / 1M tokens
Output price$15/M / 1M tokens$15/M / 1M tokens
Context1M tokens1.1M tokens
Open weightsNoNo
Free tierNoNo
ReleasedFeb 2026Mar 2026

Prices and context are synced from live provider listings. Deep dives: Claude Sonnet 4.6 and GPT-5.4.

Benchmark scores
Claude Sonnet 4.6GPT-5.4
Vending-Bench 27204.14 USD
OSWorld-Verified81.45% (Pointer Agent, 100 steps)75% (Vendor harness)
SWE-bench Verified79.6% (Vendor harness)
BrowseComp74.7% (Max thinking, tools)82.7% (Browsing)
GAIA251.9% (High, ReAct baseline)55.6% (High, ReAct baseline)
Artificial Analysis Intelligence Index51.7 (Adaptive Reasoning, Max Effort)56.8 (xhigh)
DeepSWE32% (High)56% (Extra High)
SWE-bench Pro59.1% (xHigh)

Best published configuration per model. Every config and source is on the benchmark leaderboards.

Benchmarks, head to head

Every published configuration for Claude Sonnet 4.6 and GPT-5.4 on the benchmarks they share, charted side by side. Only these two models are plotted.

DeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

Tau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

OSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

GAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

BrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.

Frequently asked questions
Is GPT-5.4 better than Claude Sonnet 4.6?

By our boards, modestly yes: GPT-5.4 scores higher on the Intelligence Index (56.8 versus 51.7 at best settings) and costs less ($2.50 versus $3 per million input tokens). Sonnet 4.6 stays competitive on everyday coding quality and is the natural pick if you are already in the Claude ecosystem.

Should I use a workhorse model or a flagship for coding agents?

Workhorses for the everyday 90%: models like GPT-5.4 and Sonnet 4.6 handle routine edits, tests, and refactors at a third of flagship prices. Route the genuinely hard tasks (architecture, gnarly debugging, long autonomous runs) to a flagship. Most agent setups support per-task model switching, so use it.

Share:
Details: