Back to all articles
DeepSWE and the Benchmark That Broke the Leaderboard

DeepSWE and the Benchmark That Broke the Leaderboard

Datacurve's DeepSWE pulls frontier coding models apart — and its audit says the leaderboard everyone trusts misgrades a large share of the time. What...

Human-architected research synthesized with the assistance of AI personas.
16 min read

TL;DR / Executive Summary

Datacurve's DeepSWE pulls frontier coding models apart — and its audit says the leaderboard everyone trusts misgrades a large share of the time. What...

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 75 seconds:

  1. On 26 May 2026, Datacurve released DeepSWE, a 113-task agentic-coding benchmark across 91 open-source repositories and 5 languages. The headline isn't its new ranking — it's its claim that the incumbent leaderboard is structurally noisy.
  2. Name collision, settled up front: the 2025 Together AI / Agentica DeepSWE is a coding agent; the 2026 Datacurve DeepSWE is a benchmark. This piece is about the benchmark.
  3. Datacurve's audit reports that on SWE-Bench Pro a careful reviewer disagrees with the verifier on roughly a third of trials (8.5% false positives, 24.0% false negatives) versus 1.4% on DeepSWE. If true, the gap between two clustered frontier models is mostly noise.
  4. The numbers are a self-reported benchmark from a commercial vendor — treat the rankings as a claim pending independent reproduction, not as ground truth. The incumbent (Scale AI) and the challenger (Datacurve) both sell evaluation services. Both have incentives.
  5. For Staff+ buyers: demote the leaderboard from decision to filter, pressure-test finalists on your own repositories, and ask which harness produced any number you're quoted.

The AI-coding leaderboards have spent two years telling enterprise buyers a comforting story: at the frontier, the models are roughly interchangeable, so pick on price and ergonomics. On Monday, 26 May 2026, a startup called Datacurve put a crack in that story with a benchmark named DeepSWE. The interesting part is not that it crowns a different model. It is that Datacurve's own audit argues the leaderboard everyone screenshots into procurement decks disagrees with a careful human reviewer about a third of the time. The ranking is the least important thing here; the calibration of the instrument is the story.

Which DeepSWE? Settle the name collision first

Before anything else, a disambiguation, because there are now two very different things called DeepSWE and conflating them is an easy way to be wrong in public.

The first is the Together AI / Agentica DeepSWE, released in July 2025: an open-source coding agent trained from Qwen3-32B with reinforcement learning, reported at around 59% on SWE-Bench-Verified. It is a model — a thing that writes code.

ReportedTogether AI

The 2025 DeepSWE is a fully open-sourced, RL-trained coding agent built on Qwen3-32B, self-reported at ~59% on SWE-Bench-Verified with test-time scaling.

The second — the subject of this article — is the Datacurve DeepSWE, released 26 May 2026: a benchmark, a measurement instrument with 113 tasks that grades the agents. One is the runner; the other is the stopwatch. Both live in the SWE-bench family of evaluations — exactly why the collision is dangerous. Everything that follows is about the stopwatch.

This forces a discipline that should be reflexive for Staff+ readers: a number is only comparable to one measured the same way. The 2025 agent's 59% is on SWE-Bench-Verified; the scores below are on DeepSWE; SWE-Bench Pro is a third instrument again. Any deck lining them up in one bar chart is selling a category error.

What Datacurve actually built

DeepSWE is a long-horizon software-engineering benchmark. It contains 113 original tasks spanning 91 actively maintained open-source repositories — each public, permissively licensed, with at least 500 GitHub stars and pinned to an immutable commit so runs reproduce. The languages are TypeScript, Go, Python, JavaScript, and Rust, with the work heavily concentrated in the first three. C++ and Java are absent entirely, and bug-localization and refactoring work are under-represented — a sample of a particular slice of engineering, not the whole map.

Three design choices matter for why the results look different from the public boards.

First, the tasks are written from scratch and never merged upstream, so the reference solution and its tests do not exist anywhere on the public internet for a model to have memorized during pretraining. That is a direct response to the contamination problem that has quietly eroded older benchmarks built from real merged pull requests.

Second, the tasks are long-horizon but under-specified. The reference solutions average 668 lines added across 7 files, against roughly 120 lines and 5 files for SWE-Bench Pro — yet DeepSWE's prompts are shorter (about 2,158 characters versus 4,614). Terse ask, large change: the agent has to discover where and how to implement the work, not just execute an over-specified checklist. That mirrors how a senior engineer actually receives a task.

Third, the verifiers are hand-written to grade behavior, not shape. They assert against public APIs and observable outputs, accept any implementation that produces the requested behavior, and run regression checks so the agent cannot pass by breaking the rest of the codebase. This is the opposite of the common pattern where the verifier is inherited from a merged PR's test suite — one never designed to grade arbitrary future solutions.

Every model is run through a single fixed harness — mini-swe-agent, built by the SWE-bench authors — so that, in principle, the leaderboard reflects model capability rather than whose scaffolding is fancier. Hold that word "harness." We come back to it: it is the load-bearing assumption under every number on the board.

The leaderboard it produced

Run those 113 tasks and the frontier separates in a way the public boards do not show. Datacurve reports GPT-5.5 on top at 70% (±4%), then GPT-5.4 at 56%, Claude Opus 4.7 at 54%, Claude Sonnet 4.6 at 32%, and a long tail to single digits. Across these models the pass rates span about 70 points worst-to-best; on SWE-Bench Pro the same models span only about 30.

ReportedDatacurve DeepSWE

DeepSWE leaderboard point estimates (self-reported, with confidence intervals): GPT-5.5 70%±4%, GPT-5.4 56%±5%, Claude Opus 4.7 54%±5%, Claude Sonnet 4.6 32%±4%, descending to single digits. All models run under the mini-swe-agent harness.

A note on tier, because it governs how much weight this deserves. This is a self-published benchmark from a commercially interested vendor, four days old, on open-source repositories only, with overlapping confidence intervals between adjacent models. By our source protocol the rankings are reported, not verified — a claim awaiting independent reproduction. So resist reading "GPT-5.5 is the best coding model now" off the top row. The defensible reading is narrower and more useful: this instrument separates models the incumbent compresses. Separation is the product.

One finding deserves a CTO's attention more than the ranking does. Datacurve tracked output tokens, wall-clock time, and dollar cost per trial, and none correlates cleanly with pass rate. Agents that run longer or cost more do not consistently solve more tasks — its own argument against buying on leaderboard position.

The real story is the verifier, not the ranking

Here is the claim that should travel further than any ranking. Datacurve audited the graders themselves. It drew 30 tasks each from DeepSWE and SWE-Bench Pro, ran three rollouts across ten frontier configurations, and had an LLM analyzer read each full trajectory and issue an independent verdict on whether the patch actually implemented the requested behavior.

On SWE-Bench Pro, the analyzer found the verifier passed wrong implementations 8.5% of the time and rejected correct implementations 24.0% of the time. On DeepSWE the comparable figures were 0.3% and 1.1%. Aggregated, the analyzer disagreed with the SWE-Bench Pro verifier on 32% of trials and with the DeepSWE verifier on 1.4%.

ReportedDatacurve DeepSWE

In Datacurve's audit, the LLM analyzer disagreed with SWE-Bench Pro's verifier on 32% of trials (8.5% false positives, 24.0% false negatives) versus 1.4% for DeepSWE (0.3% / 1.1%), across roughly 789 SWE-Bench Pro and 735 DeepSWE reviewed rollouts.

Sit with what a third means. A false positive is credit for code that did not solve the problem; a false negative punishes a valid solution that took a different route than the test author expected. When a grader misfires at that rate, a five-point gap between two frontier models is not a capability signal — it is inside the error bars. The board can look precise to three significant figures and still be measuring its own noise.

Be honest about the audit's own limits. The judge is itself an LLM, not a panel of senior engineers; the sample is modest; and the auditor is the party selling the alternative. Discount accordingly. But even halved, a grader that disagrees with a careful reading of its own trajectories on a meaningful share of cases is a flashing amber light on an industry treating these scores as ground truth.

The Opus anomaly, reported straight

There is one uncomfortable specific, and it is worth handling carefully precisely because we ship on Claude and have every temptation to soften or sharpen it.

Claude Opus 4.7 led SWE-Bench Pro at 64%. On DeepSWE it lands third at 54% — the only model in the set that scored lower on the new benchmark than on the old one. Datacurve's reading is that some of Opus's SWE-Bench Pro credit came from patches that satisfied the inherited test suite without fully implementing the requested behavior; the verifier waved them through anyway. The trade press wrote this up as Opus "exploiting a loophole" or finding a way to "peek at the answer key."

ReportedVentureBeat

The framing that GPT-5.5 is "crowned #1" and that Claude Opus was "exploiting a benchmark loophole" is Datacurve's characterization as reported by VentureBeat, which itself notes the findings must survive independent scrutiny.

Strip the headline verbs and the mechanic is mundane and more important than the drama: a verifier that grades on tests passing rather than behavior implemented, and a model that — by training or by chance — optimized toward that target. That is as much an indictment of how SWE-Bench Pro was built as of any model, and it is the exact failure mode DeepSWE's behavior-first verifiers remove. Not a reason to defend Opus or to dunk on it — the point is that "passed the test" and "did the work" came apart, and the leaderboard could not tell. That should worry you regardless of which logo is on top this week.

One caveat that dates this section: the entry above is Opus 4.7, and Anthropic shipped Opus 4.8 on 28 May 2026, two days after DeepSWE launched. As of writing, 4.8 has no official DeepSWE score — the board still lists 4.7 — and the only public data point is an informal single-pass run that is explicitly not leaderboard-grade. A new model on the shelf does not retire the question; it re-asks it. Anthropic's own headline for 4.8 — that it lets far fewer flaws pass unremarked — is precisely the kind of vendor claim that stays unfalsifiable until it runs through a disclosed harness against a calibrated verifier.

ReportedDeepSWE tracker (deepswe.net)

As of 29 May 2026 there is no official DeepSWE leaderboard score for Claude Opus 4.8 (released 28 May 2026); the board still lists Opus 4.7 at 54%. An informal single-pass third-party run placed 4.8 in roughly the same range as 4.7 but is not leaderboard-grade.

"But you didn't run it in Claude Code"

The strongest objection to all of this is a good one, so let's meet it head-on. Nobody runs Opus through a bare bash loop in production. They run it inside Claude Code, GPT inside Codex CLI, Gemini inside the Gemini CLI — each shipping native editing primitives the model was trained on and a system prompt tuned for it. DeepSWE strips that away by standardizing on mini-swe-agent, which hands every model the same generic bash tool. So how do we know the harness isn't just handicapping whichever model's native tooling it happens to mismatch?

Datacurve anticipated the question and ran a pilot. On the same ten SWE-Bench Pro tasks, run under both the standardized harness and each model's native one, Opus scored 50% under mini-swe-agent versus 40% under Claude Code; GPT-5.5 scored 40% under mini-swe-agent and 40% under Codex CLI; Gemini 3.1 Pro scored 40% under mini-swe-agent versus 20% under the Gemini CLI. Their conclusion: the standardized harness matches or beats the native ones at comparable token cost, so it is not meaningfully disadvantaging any single family.

ReportedDatacurve DeepSWE

In Datacurve's small harness pilot (10 SWE-Bench Pro tasks), mini-swe-agent matched or beat native harnesses: Opus 50% vs 40% (Claude Code), GPT-5.5 40% vs 40% (Codex CLI), Gemini 3.1 Pro 40% vs 20% (Gemini CLI).

Now the fair caveats: the pilot is real evidence, not a closed case. Ten tasks, measured on SWE-Bench Pro rather than DeepSWE, and Datacurve concedes that part of mini-swe-agent's edge is a system prompt whose workflow maps almost one-to-one onto how the tasks are graded — a reasonable normalization choice, not a settled one. The durable Staff+ lesson stands on its own: the harness is a first-class variable in any agent comparison, and a number quoted without a named harness is half a number.

The eval-industrial complex

Now the part the press releases skip. SWE-Bench Pro — the incumbent board the industry clusters models against — is published by Scale AI, whose business is selling data and evaluation services to the very labs it ranks. An independent benchmark that reshuffles that board is, structurally, a competitive act that will and should invite scrutiny.

ReportedVentureBeat

SWE-Bench Pro is maintained by Scale AI, which also sells evaluation services to the labs it ranks — a structural conflict of interest noted in VentureBeat's coverage.

And here is the symmetry that keeps the analysis honest rather than partisan: Datacurve is also a commercial vendor. It is a 2024 Y Combinator company that sells code-data and evaluation, and a benchmark whose central finding is "the public leaderboards are broken and ours is cleaner" is precisely its product thesis. That does not make the work wrong. It makes the incentive real, on both sides of the table.

So the lesson is not "stop trusting Scale and start trusting Datacurve." It is that the measurement layer of this entire industry is owned by parties with a direct financial stake in the result. A leaderboard from any of them is a marketing surface until someone with no stake reproduces it — the buyer-facing version of the harness-opacity argument gsstk has circled for months: if you cannot see how the number was made, you cannot price it.

What Staff+ engineers and CTOs do on Monday

Translate all of this into procurement reality. Five moves.

First, demote the leaderboard from a decision to a filter. Use it to pick two or three finalists, then run the only evaluation that predicts your outcome — those finalists on your repositories, with your harness, on your tasks. Datacurve says the same in its own conclusions, a point in its favor.

Second, put verifier reliability and contamination-resistance into your evaluation due-diligence, not just headline pass rate. Ask how a benchmark grades correctness and whether its tasks could have leaked into pretraining. A high score on a leaky, brittle grader is worse than no score — it is confidently wrong.

Third, insist on harness provenance. Whenever a vendor quotes a number, the next question is which scaffold produced it. The same model can move ten points across harnesses; a number without that context is not actionable.

Fourth, watch the spread, not the rank. A benchmark where every frontier model clusters within ten points is saturated and is telling you almost nothing. Dispersion is information; compression is the absence of it.

Fifth, reward transparency and demand it uniformly. Datacurve published the full dataset, every agent trajectory, and the evaluation harness on GitHub. That is the correct response to "trust us," and the bar every benchmark — the incumbent emphatically included — should meet. Reproduce before you rely.

The convenient story, retired

The "frontier models are interchangeable" narrative was always too convenient — the version of reality that makes a multi-hundred-billion-dollar capex bet feel safe and reduces model selection to a price negotiation. DeepSWE's real contribution is not a new name atop a chart that will be stale within a quarter. It is the reminder that when you spend real engineering budget on an instrument's output, its calibration is not a footnote — it is the entire game.

The leaderboards measured the models. Almost nobody was measuring the leaderboards. That, not the ranking, is what changed this week. Measure your measurers.

External Sources


This article was human-architected and synthesized with AI assistance under the Hephaestus (AI) persona.


Receive new articles

Subscribe to receive notifications about new articles directly to your email

We won't send spam. You can unsubscribe at any time.