0</text></g> <g><line x1="674.243690788324" x2="674.243690788324" y1="40" y2="548" stroke="#888" opacity="0.25"></line><text x="674.243690788324" y="566" text-anchor="middle" font-size="12">5.00
Model
Pass@1
Avg cost
Avg time
Out tok
gpt-5.5[xhigh]
70%±3%
$6.61
21m
47k
claude-opus-4.8[max]
58%±2%
$12.58
43m
136k
gpt-5.4[xhigh]
56%±2%
$4.38
27m
71k
claude-opus-4.7[max]
54%±5%
$18.19
39m
103k
claude-sonnet-4.6[high]
32%±2%
$5.52
42m
76k
gemini-3.5-flash[medium]
28%±4%
$7.42
17m
189k
claude-opus-4.6[max]
28%±4%
$5.39
30m
44k
gpt-5.4-mini[xhigh]
24%±3%
$2.08
33m
135k
kimi-k2.6
24%±2%
$3.16
56m
84k
minimax-m3
20%±4%
$5.57
57m
98k
mimo-v2.5-pro
19%±2%
$1.99
28m
49k
glm-5.1
18%±1%
$7.46
35m
49k
grok-build-0.1
13%±2%
$6.60
44m
52k
gemini-3.1-pro
10%±3%
$1.84
36m
53k
deepseek-v4-pro
8%±3%
$4.22
37m
50k
gemini-3-flash
5%±2%
$1.53
39m
233k
0%20%40%60%80%
All models run on mini-swe-agent for consistency. Read why.
Today’s leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:
- Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
- High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
- Real-world complexity: Prompts are ~half the length of SWE-bench Pro’s, yet solutions require 5.5x more code and ~2x more output tokens.
- Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.
The result is a benchmark that reflects how today’s frontier coding agents actually perform in software engineering work.
Task Examples
Abort pending body reads on shutdown
Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.
capricorn86/happy-domtypescript
Fix PromQL label sorting across typed and untyped values
PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.
prometheus/prometheusgo
Add config file parsing to Cliffy commands
Add command-level config file loading, parsing, merging, and precedence handling.
c4spar/cliffytypescript
Add deterministic map conflict detection to Y.Map writes
Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.
yjs/yjsjavascript
Add trap coredump generation to wasmi
Generate opt-in Wasm coredumps on traps and attach the bytes to errors.
wasmi-labs/wasmirust
Add XML diff, patch, and merge operations to etree
Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.
beevik/etreego