DeepSWE score 0</text></g> <g><line x1="674.243690788324" x2="674.243690788324" y1="40" y2="548" stroke="#888" opacity="0.25"></line><text x="674.243690788324" y="566" text-anchor="middle" font-size="12">5.00 10</text></g> <g><line x1="222.7310723649717" x2="222.7310723649717" y1="40" y2="548" stroke="#888" opacity="0.25"></line><text x="222.7310723649717" y="566" text-anchor="middle" font-size="12">15 0% 10% 20% 30% 40% 50% 60% 70% 80% Avg cost per task most efficient ↗ gpt-5.4 [xhigh] claude-opus-4.8 [high] Default gpt-5.5 [medium] Default claude-opus-4.7 [xhigh] Default claude-sonnet-4.6 [high] gemini-3.5-flash [medium] claude-opus-4.6 [max] gpt-5.4-mini [xhigh] kimi-k2.6 minimax-m3 mimo-v2.5-pro glm-5.1 grok-build-0.1 gemini-3.1-pro deepseek-v4-pro gemini-3-flash

Model

Pass@1

Avg cost

Avg time

Out tok

gpt-5.5[xhigh]

70%±3%

$6.61

21m

47k

claude-opus-4.8[max]

58%±2%

$12.58

43m

136k

gpt-5.4[xhigh]

56%±2%

$4.38

27m

71k

claude-opus-4.7[max]

54%±5%

$18.19

39m

103k

claude-sonnet-4.6[high]

32%±2%

$5.52

42m

76k

gemini-3.5-flash[medium]

28%±4%

$7.42

17m

189k

claude-opus-4.6[max]

28%±4%

$5.39

30m

44k

gpt-5.4-mini[xhigh]

24%±3%

$2.08

33m

135k

kimi-k2.6

24%±2%

$3.16

56m

84k

minimax-m3

20%±4%

$5.57

57m

98k

mimo-v2.5-pro

19%±2%

$1.99

28m

49k

glm-5.1

18%±1%

$7.46

35m

49k

grok-build-0.1

13%±2%

$6.60

44m

52k

gemini-3.1-pro

10%±3%

$1.84

36m

53k

deepseek-v4-pro

8%±3%

$4.22

37m

50k

gemini-3-flash

5%±2%

$1.53

39m

233k

0%20%40%60%80%

All models run on mini-swe-agent for consistency. Read why.

Today’s leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

  • Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
  • High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
  • Real-world complexity: Prompts are ~half the length of SWE-bench Pro’s, yet solutions require 5.5x more code and ~2x more output tokens.
  • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today’s frontier coding agents actually perform in software engineering work.

Task Examples

Abort pending body reads on shutdown

Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.

capricorn86/happy-domtypescript

Fix PromQL label sorting across typed and untyped values

PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.

prometheus/prometheusgo

Add config file parsing to Cliffy commands

Add command-level config file loading, parsing, merging, and precedence handling.

c4spar/cliffytypescript

Add deterministic map conflict detection to Y.Map writes

Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.

yjs/yjsjavascript

Add trap coredump generation to wasmi

Generate opt-in Wasm coredumps on traps and attach the bytes to errors.

wasmi-labs/wasmirust

Add XML diff, patch, and merge operations to etree

Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.

beevik/etreego

All 113 tasks