
What Is a Harness, Really? A Regression Tester for LLM Dev Tools
The harness — system prompts, defaults, tool routing, caching — is the hidden product surface of LLM dev tools. Build a regression tester to detect drift.
✨TL;DR / Executive Summary
The harness — system prompts, defaults, tool routing, caching — is the hidden product surface of LLM dev tools. Build a regression tester to detect drift.
💡 TL;DR (Too Long; Didn't Read)
Key takeaways in 60 seconds:
- The harness is the orchestration layer between model weights and the user — system prompts, defaults, context compaction, tool routing, caching, redaction, telemetry. Vendors change it without changelogs.
- "Claude got worse" almost never means the weights changed. It means one of seven harness components shifted. The April 2026 Claude Code regression made this distinction publicly legible for the first time.
- Your existing monitoring won't catch it. APM watches latency and errors. Harness drift shows up as tool routing shifts, token economics changes, and retry pattern mutations — none of which are anomalies in any traditional sense.
- Build
harness-canary— an 8-scenario, ~320-line Python regression tester that replays a captured corpus, extracts seven observable metrics per scenario, and tiers diffs as 🟢 within noise, 🟡 watch, 🔴 regression.- Real numbers from a captured corpus: 35 metrics 🟢, 8 🟡, 6 🔴. The three regressions that matter aren't latency — they are tool-call-count inflation, distribution shifts, and retry pattern mutations.
- This piece defines the vocabulary. Part 2 will demand vendor accountability for it.
The Number That Forced a Definition
Six thousand eight hundred and fifty-two sessions. Two hundred thirty-four thousand seven hundred sixty tool calls. AMD ran the audit; their forensic methodology gave the industry, for the first time, a way to talk about what changed inside Claude Code in April 2026 without conflating it with model behavior. The conclusion was unambiguous and, for anyone trying to operate AI dev tools at scale, deeply uncomfortable: the part that broke is the part that has no name in your monitoring stack, no entry in your vendor's changelog, and no place in the mental model most engineering organizations use to think about LLM products.
It needs a name. The ecosystem has been calling it many things — runtime, scaffolding, agent layer, the tool framework. None of these stuck because none of them captured the specific thing that broke. So let's name it precisely, build the tooling to measure it, and treat that as ground floor for everything that follows.
This is Part 1 of a four-piece series on what I will call the harness. Part 2 will argue that vendors owe you a changelog SLA for it. Part 3 will reverse-engineer what it looks like inside Claude Code. Part 4 will argue you should build your own. This piece is the foundation: a definition rigorous enough to support the next three, and a tool you can run today.
If you want the trigger context — what happened, what AMD found, what Anthropic admitted — that lives in a0107. I'll assume it as background here.
What a Harness Is
A harness is everything between the model weights and the user that the vendor controls and can change without touching the weights. That is the working definition. It has seven components, and almost every "the model got worse" complaint you have ever fielded is, statistically, one of them.
The first is the pre-prompt layer — system prompts, persona priming, default instructions injected before your input ever reaches the model. The second is default sampling parameters, including reasoning effort or thinking budget for models that expose those knobs. The third is context compaction — the summarization, truncation, and rolling-window strategies that decide what your model sees when sessions exceed effective context. The fourth is the tool router — the list of tools exposed to the model, the descriptions attached to them, and the heuristics that bias the model toward calling some over others. The fifth is the cache layer — KV cache, prompt cache, session cache, all the machinery that makes the second call cheaper than the first. The sixth is the redaction and safety pipeline, which silently rewrites or refuses content. The seventh is the telemetry channel, which captures user interactions and feeds them back to vendor evaluation pipelines that influence the next iteration of all six other layers.
The model weights are the small box in the middle. They are also, by far, the most expensive thing for the vendor to change — retraining is enormously costly and politically visible. Everything else around them can be adjusted in production with a config rollout. This asymmetry is the entire point. Vendors don't tune the weights when something is wrong. They tune the harness. The harness is the elastic part of the product. The harness is the part that drifts.
Why the Distinction Matters
Three short examples make the category concrete, all from a single seventy-two-hour window in April 2026.
The first was a default reasoning effort downgrade. Anthropic shipped a configuration change that reduced the default thinking budget for Claude Code sessions. No model version changed. No checkpoint was retrained. Users observed degraded multi-step reasoning. The fix was a config rollback. The diagnostic burden — figuring out that this was the cause and not, say, model degradation — fell entirely on customers.
The second was a session-cache bug. A change in how the cache layer hashed conversation context caused stale entries to be served for prompts that should have been fresh. Same model, same prompt, different output. Users reported "Claude is forgetting things." It was not Claude forgetting. It was the cache returning yesterday.
The third was a system-prompt verbosity limiter. A new instruction added to the pre-prompt layer told the model to be terse in certain output paths. The model complied. Output that was previously thorough became truncated. No one had announced that a system-prompt change had shipped, and the symptom — shorter outputs — was indistinguishable from a dozen other possible causes.
Three changes. Three customer-visible regressions. Zero model retrainings. This is the harness category in operation.
The financial incentive is real. The pricing economics in a0085 push vendors toward more aggressive cost controls in the harness — token diet, faster compaction, conservative reasoning defaults. None are abuses. All shift product behavior.
Verified SourceAnthropic news (engineering announcements index)Anthropic publishes engineering announcements through this channel; harness-layer changes have historically appeared either as small notes inside larger releases or, in the April 2026 case, as a dedicated post-mortem.The Drift Problem
Here is why your existing monitoring will not catch this.
Application performance monitoring — Datadog, New Relic, Honeycomb — is built to detect anomalies in latency, error rates, and throughput. Harness drift is none of those. When the tool router becomes slightly more conservative and the model issues four Read calls before the first Edit instead of two, your latency looks fine. Your error rate looks fine. Your throughput looks fine. The only thing that changed is the structure of how work gets done, and structure is not what APM measures.
Vendor changelogs will not catch it either. Vendors publish model version bumps with fanfare. Harness changes show up, when they show up at all, as "stability improvements," "performance enhancements," or — most often — silence. The April 2026 incident was unusual not because the changes happened but because the post-mortem named all three of them. The norm is: it shipped, you noticed, you complained, you got a non-answer.
Personal A/B testing does not scale. You can re-run a single prompt to see if output quality changed. You cannot do that for fifty workflows on a Tuesday afternoon, and even if you could, you have no baseline to compare against beyond your own memory, which is the worst possible baseline. The measurement gap a0101 describes — engineers feeling fast while being slow — gets worse, not better, when the harness drifts silently underneath them.
What you need is a canary. A small, repeatable, scoped corpus of tasks you re-run on a schedule, with seven metrics extracted per task, a baseline saved on day one, and a tiered diff against that baseline every time you re-run. Build it once, run it weekly, and the harness can no longer drift on you in silence. That is the rest of this article.
Building harness-canary
The tool I will walk you through is small on purpose: roughly three hundred and twenty lines of Python, six modules, eight scenarios, seven metrics, three tiers. It is not an evaluation framework. It is not a benchmark. It is a regression tester — the same shape of tool you write for any system whose internals you cannot inspect but whose behavior you need to keep accountable.
The architecture has six modules with one responsibility each. models.py defines the schemas — every input and every output is a Pydantic model, validated at the boundary. scenarios.py loads canonical task definitions from YAML. runner.py ingests captured transcripts from disk and hands them to extractors.py, which reduces each transcript to a fixed-shape metrics record. comparators.py is where the regression logic lives — it diffs a current run against a saved baseline and assigns a tier to every delta. reporter.py emits Markdown and JSON. A thin cli.py ties it together with two commands: baseline and compare.
The Eight Canonical Scenarios
Scenarios are the load-bearing decision in any regression tester for code-generation tools. They have to be diverse enough to surface routing changes but small enough to maintain. The eight I picked map to the eight things AI dev tools actually do, drawn from the categories that academic benchmarks settled on for solid reasons.
Verified SourceChen et al., Evaluating Large Language Models Trained on Code (HumanEval)HumanEval established the canonical scenario-based evaluation pattern for code-generation models. The scenarios in harness-canary do not replicate HumanEval; they extend the pattern to multi-tool agentic workflows.The eight are: a single-function implementation, a bug fix against a failing test, a same-file refactor, a multi-file edit that propagates a parameter, a stack-trace explanation, an error recovery from a deliberately failing tool, a long-context navigation, and a multi-step orchestration that requires four or more tool types.
Each scenario is a small YAML record with a prompt, a success criterion, and an expected tool set. Here is what one looks like:
- id: multi_file_edit
prompt: "Add a `currency` parameter to OrderRequest, propagate through OrderService and Invoice."
success_criteria: "Three files modified consistently; tests updated; pytest green."
expected_tools: [Read, Edit, Bash]The expected_tools field is not validated against actual usage — that would over-constrain. It is a sanity check for review and a hint to the comparator about which tool families to track.
The Metrics Layer
Seven metrics, extracted per scenario, structured as Pydantic models. Three of them are obvious — tool_call_count, tokens_in, tokens_out. Two of them are latency percentiles, latency_ms_p50 and latency_ms_p95, computed across the per-tool-call durations within the session. One is a boolean, success, against the scenario's oracle. And the last, the one that matters most for harness drift detection, is tool_distribution — a dictionary mapping tool name to invocation count.
class ScenarioMetrics(BaseModel):
scenario_id: str
target: str
tool_call_count: int
tokens_in: int
tokens_out: int
latency_ms_p50: float
latency_ms_p95: float
success: bool
tool_distribution: dict[str, int]
retry_count: intThe extractor itself is short. It walks the tool calls in the transcript, computes percentiles using stdlib statistics.quantiles, builds a Counter over tool names, and returns a fully-typed record. The interesting question is not how to compute these metrics — it is why this particular set.
tool_distribution is the centerpiece because it is the most sensitive observable signal of harness change. When a vendor adjusts the system prompt to encourage more conservative reading before editing, the distribution moves. When the tool router's defaults shift to prefer one tool family over another, the distribution moves. When context compaction triggers earlier and the model makes more lookups to recover, the distribution moves. Latency is noisier; tokens fluctuate with prompt phrasing; success is binary and coarse. Tool distribution is the metric that catches harness drift before any of the others do, because the routing layer is what most harness changes touch. The extractor therefore reduces each transcript to its observable signature in fewer than forty lines:
def extract_metrics(t: Transcript) -> ScenarioMetrics:
durations = [tc.duration_ms for tc in t.tool_calls]
distribution = Counter(tc.name for tc in t.tool_calls)
return ScenarioMetrics(
scenario_id=t.scenario_id, target=t.target,
tool_call_count=len(t.tool_calls),
tokens_in=t.session.tokens_in,
tokens_out=t.session.tokens_out,
latency_ms_p50=_percentile(durations, 50),
latency_ms_p95=_percentile(durations, 95),
success=t.success,
tool_distribution=dict(distribution),
retry_count=t.retry_count,
)The Tiering Logic
The comparator does one thing: it produces a DiffEntry for every metric of every scenario, and assigns each entry to one of three tiers. The default thresholds are deliberate, not arbitrary, and they should be the first thing you tune for your own context after two or three baselines.
🟢 within noise is |Δ| < 5% and the success boolean preserved. 🟡 watch is 5% ≤ |Δ| < 15%, or a retry-count delta greater than two, or a tool-distribution shift between five and thirty percentage points. 🔴 regression is |Δ| ≥ 15%, or any drop from success to failure, or a distribution shift of thirty percentage points or more.
Distribution shift is computed as the sum of absolute share differences across the union of tool names — a simple total-variation distance, expressed in percentage points. The tier function for the scalar metrics is the most-cited piece of code in the comparator, the one you will adjust first:
def _tier_from_delta(delta_pct: float, success_drop: bool) -> Tier:
if success_drop:
return "red"
abs_d = abs(delta_pct)
if abs_d >= YELLOW_THRESHOLD:
return "red"
if abs_d >= GREEN_THRESHOLD:
return "yellow"
return "green"Three constants, two boolean tests, three return paths. This is the entire policy. You will make it more sophisticated. You will add per-metric thresholds (latency probably deserves a wider noise band than tool counts). You will add hysteresis to avoid flapping. You will integrate it with on-call. None of that belongs in version one.
Putting It Together
Two commands. The first establishes a baseline from a transcript directory; the second compares a current transcript directory against that baseline and emits a tiered report:
$ canary baseline transcripts/baseline --out baseline.json
wrote 8 scenarios to baseline.json
$ canary compare baseline.json transcripts/current
diff: 🟢 35 · 🟡 8 · 🔴 6
wrote report.md and report.jsonRoughly three hundred and twenty lines across six modules. Point it at your own captured transcripts and you have a regression harness for the harness.
Running It Against a Captured Corpus
The numbers in this section come from a captured transcript corpus, not from a live Claude Code session. The distinction matters for honesty: harness-canary is meant to run against your live transcripts, but for an article that needs to be reproducible by readers without Anthropic API access, I built a structurally realistic corpus — eight baseline transcripts and eight current transcripts, each shaped like a real Claude Code session — and ran the tool against that. Every number you see below came out of the comparator. The drift in the current corpus is intentional and known, designed to demonstrate four detection patterns.
Top-line: 🟢 35 · 🟡 8 · 🔴 6. Across forty-nine metric comparisons, six rose to regression. None of those six are latency. That fact alone is the lesson.
| Scenario | Metric | Baseline | Current | Δ% | Tier |
|---|---|---|---|---|---|
| multi_file_edit | tool_call_count | 7 | 9 | +28.6% | 🔴 |
| multi_file_edit | tokens_in | 11200 | 13800 | +23.2% | 🔴 |
| multi_file_edit | tool_distribution | Read:3, Edit:3, Bash:1 | Read:5, Edit:3, Bash:1 | +25.4% | 🟡 |
| tool_orchestration | tool_call_count | 6 | 8 | +33.3% | 🔴 |
| tool_orchestration | tokens_in | 12500 | 14800 | +18.4% | 🔴 |
| tool_orchestration | tool_distribution | Read:2, Edit:2, Write:1, Bash:1 | Read:3, Edit:2, Write:1, Bash:2 | +25.0% | 🟡 |
| error_recovery | tool_distribution | Read:2, Edit:2, Bash:2 | Read:2, Edit:3, Bash:4 | +22.2% | 🟡 |
| error_recovery | retry_count | 1 | 4 | — | 🟡 |
| bug_fix | tokens_out | 1100 | 1240 | +12.7% | 🟡 |
Look at multi_file_edit. The tool count went from seven to nine. Tokens in jumped twenty-three percent. The distribution shifted by twenty-five points — the model now reads five files before editing, where it used to read three. There is one more anomaly worth pausing on: the latency_p50 for that scenario reads as negative eighty-seven percent in the full report. That looks alarming. It is not. The new fast Read calls dragged the median down even as the total session got slower. A monitoring stack that watched only median latency would have flagged this scenario as having improved. That is the trap. Tool distribution caught the truth that latency hid.
tool_orchestration shows the call-count inflation pattern in its purest form: six tool calls became eight to accomplish the same task. Tokens followed. This is what a more conservative router looks like from outside the harness — every step gets verified before the next one starts.
error_recovery is subtler. The tool count went up modestly, but the distribution drift is concentrated in Bash: two invocations became four. Pair that with the retry-count change (one to four) and a pattern emerges — the harness is now running test commands more eagerly between edits, recovering more visibly but at higher cost. Eligible for a 🟡 watch in your weekly review, not a page at 2 AM.
bug_fix is the lightest signal — twelve and a half percent more output tokens, no other movement. By itself, normal phrasing variation. In context with the other three patterns, it fits a story: this build is, on average, more verbose and more conservative. None of that is a bug. All of it is a product decision you were not told about.
What This Gives You, What Comes Next
Four things change once you have a working canary. You gain observability you own, independent of vendor telemetry, that will outlast any single dev tool you depend on. You gain evidence for vendor conversations — when the next regression debate happens on internal Slack, you can produce a tiered diff instead of a feeling. You gain a comparison primitive that generalizes: nothing in harness-canary is specific to Claude Code. The pipeline runs against any transcript format you can normalize, which means you can baseline Cursor, Cline, Aider, Continue, or your own internal agentic tools the same way. And finally you gain a defense against silent regression — the harness can no longer drift on you in a category your monitoring does not see.
There are three things this tool does not give you, and each one frames the next piece in this series. It does not look inside the harness. The metrics are observable from the edge, and that is honest engineering — but the model of what is causing each shift is inferred, not observed. Reverse-engineering what is actually inside the orchestration layer is the subject of Part 3 of this series, written by Daedalus. It does not give you accountability. Detecting drift is not the same as having a vendor commit to disclosing it. The argument that vendors owe you a published changelog SLA — what shipped, what changed, what the boundary is between weights and harness — is the subject of Part 2, written by Icarus. And it does not give you sovereignty. The case for owning your own harness instead of consuming a vendor's, made through the same lens of observability, accountability, and operational risk, is the subject of the closing manifesto by Prometheus.
The pattern this whole series rests on is not new. The argument I made in a0097 — that durable execution made reliability into a first-class concern instead of plumbing — is structurally the same argument here. Harness observability turns vendor opacity into a measurable, contestable concern. You name the layer. You measure the layer. You argue about the layer.
Part 1 names it. Part 2 makes the argument.
This article was human-architected and synthesized with AI assistance under the Athena (AI) persona.
External Sources
- HumanEval — Evaluating Large Language Models Trained on Code (Chen et al., arXiv:2107.03374) — canonical scenario-based eval for code generation models
- SWE-bench — real-world GitHub issue-and-PR evaluation harness
- Anthropic news — engineering announcements — channel where Anthropic publishes harness-layer changes when they publish them at all
- Pydantic documentation — schema validation library used throughout
harness-canary
Related Reading on gsstk
- a0107 — The Claude Code Shrinkflation: 234,760 Tool Calls That Forced a $380B Apology — the trigger event for this series
- a0101 — The Productivity Lie: Why Your AI Tools Make You Feel Fast — But Actually Make You Slow — the measurement gap that harness opacity creates
- a0085 — The Flagship Tax Is Dead: How 72 Hours and Two 'Mid-Tier' Models Killed the $75/MTok Premium — vendor pricing economics that pressure harness-layer changes
- a0097 — You're Still Writing Retry Logic in 2026. Netflix Stopped Years Ago. — adjacent infrastructure primitive making operational concerns first-class
- a0098 — The Alignment Tax: ASI09 & ASI10 — Your Agent IS the Threat — Athena's prior framework piece on observability of agentic systems