DeepSWE

$0</text></g> <g><line x1="674.243690788324" x2="674.243690788324" y1="40" y2="548" stroke="#888" opacity="0.25"></line><text x="674.243690788324" y="566" text-anchor="middle" font-size="12">$ 5.00 $10</text></g> <g><line x1="222.7310723649717" x2="222.7310723649717" y1="40" y2="548" stroke="#888" opacity="0.25"></line><text x="222.7310723649717" y="566" text-anchor="middle" font-size="12">$ 15 0% 10% 20% 30% 40% 50% 60% 70% 80% Avg cost per task most efficient ↗ gpt-5.4 [xhigh] claude-opus-4.8 [high] Default gpt-5.5 [medium] Default claude-opus-4.7 [xhigh] Default claude-sonnet-4.6 [high] gemini-3.5-flash [medium] claude-opus-4.6 [max] gpt-5.4-mini [xhigh] kimi-k2.6 minimax-m3 mimo-v2.5-pro glm-5.1 grok-build-0.1 gemini-3.1-pro deepseek-v4-pro gemini-3-flash

Model

Pass@1

Avg cost

Avg time

Out tok

gpt-5.5[xhigh]

70%±3%

$6.61

21m

47k

claude-opus-4.8[max]

58%±2%

$12.58

43m

136k

gpt-5.4[xhigh]

56%±2%

$4.38

27m

71k

claude-opus-4.7[max]

54%±5%

$18.19

39m

103k

claude-sonnet-4.6[high]

32%±2%

$5.52

42m

76k

gemini-3.5-flash[medium]

28%±4%

$7.42

17m

189k

claude-opus-4.6[max]

28%±4%

$5.39

30m

44k

gpt-5.4-mini[xhigh]

24%±3%

$2.08

33m

135k

kimi-k2.6

24%±2%

$3.16

56m

84k

minimax-m3

20%±4%

$5.57

57m

98k

mimo-v2.5-pro

19%±2%

$1.99

28m

49k

glm-5.1

18%±1%

$7.46

35m

49k

grok-build-0.1

13%±2%

$6.60

44m

52k

gemini-3.1-pro

10%±3%

$1.84

36m

53k

deepseek-v4-pro

8%±3%

$4.22

37m

50k

gemini-3-flash

5%±2%

$1.53

39m

233k

0%20%40%60%80%

All models run on mini-swe-agent for consistency. Read why.

Today’s leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are ~half the length of SWE-bench Pro’s, yet solutions require 5.5x more code and ~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today’s frontier coding agents actually perform in software engineering work.

All 113 tasks

Hyperreality

Explorer

DeepSWE

Task Examples

Abort pending body reads on shutdown

Fix PromQL label sorting across typed and untyped values

Add config file parsing to Cliffy commands

Add deterministic map conflict detection to Y.Map writes

Add trap coredump generation to wasmi

Add XML diff, patch, and merge operations to etree

Graph View

Table of Contents