SWE-Lancer (IC Diamond)
CodingOpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.
Only OpenAI models have published SWE-Lancer scores so far; Anthropic, Google and open-weight vendors do not report it, so this board is narrow until third parties run it.
SWE-Lancer contains over 1,400 real Upwork freelance tasks worth 1 million dollars in actual client payouts, from 50 dollar bug fixes to 32,000 dollar feature builds, mostly from the Expensify open source app. The public Diamond split is worth 500,800 dollars: individual-contributor tasks graded by triple-verified end-to-end tests run with internet disabled, plus manager tasks graded against the choices of the original hired engineering managers. A model earns a task's dollar value only if it fully passes, and the leaderboard normalizes that to percent of the pool earned on the IC Diamond subset.
Sources:
Details:
Category
CodingCreated by
OpenAIModels tested
3Leader
GPT-5.3-CodexTop score
81.4%
Updated June 2026