Back to all articles
The Compiler vs The Browser: Two Armies of AI Agents Walk Into a Codebase

The Compiler vs The Browser: Two Armies of AI Agents Walk Into a Codebase

Anthropic's 16 Claude agents built a C compiler. Cursor's hundreds built a browser. A deep teardown of two blueprints for autonomous software development.

Human-architected research synthesized with the assistance of AI personas.
25 min read

✨TL;DR / Executive Summary

Anthropic's 16 Claude agents built a C compiler. Cursor's hundreds built a browser. A deep teardown of two blueprints for autonomous software development.

πŸ’‘ TL;DR (Too Long; Didn't Read)

Key takeaways in 60 seconds:

  • Anthropic tasked 16 Claude Opus 4.6 agents with building a C compiler from scratch in Rust. Zero dependencies. 100K+ lines of code. 99% pass rate on the GCC torture test suite. The Linux kernel boots. Doom runs. Cost: $20,000.
  • Cursor pointed hundreds of GPT-5.2 agents at building a web browser from scratch. 1M+ lines of Rust. 3,342 commits. It renders static pages. GitHub Actions CI has an 88% failure rate.
  • Two radically different coordination architectures: Anthropic used a flat git-based swarm with no orchestrator β€” the test suite is the orchestrator. Cursor evolved through three architectures before settling on a hierarchical planner/worker/judge pipeline.
  • The same model that builds also breaks: On the same day, Anthropic published that Claude found 500+ zero-days in well-tested open source projects. The capabilities are dual-use.
  • The harness matters more than the model. Anthropic got measurably better results with 16 agents and a great test suite than Cursor got with hundreds of agents and a hierarchical planner.
  • Bottom line: We're watching the birth of a new engineering discipline β€” the systematic design of harnesses, test suites, and coordination protocols for autonomous code-generating agents. Tests are the new moat. Architecture is the new code.

1. The Setup: Two Impossible Projects

In January 2026, Cursor pointed hundreds of GPT-5.2 agents at an ambitious goal: build a web browser from scratch. Three weeks later, they had FastRender β€” over a million lines of Rust, 3,342 commits, and a rendering engine that could sort of display real web pages.

Three weeks after that, Anthropic published a response that nobody expected. Nicholas Carlini, a researcher on their Safeguards team, had tasked 16 Claude Opus 4.6 instances with building a C compiler. From scratch. In Rust. Capable of compiling the Linux kernel. The result: 100,000 lines of code, a 99% pass rate on the GCC torture test suite, and a $20,000 API bill.

Two companies. Two impossible projects. Two radically different architectures for coordinating autonomous agents. And buried in the technical details of each experiment lies a blueprint for how software will be built for the next decade.

This is not a puff piece. We're going to tear both projects apart.


2. The Problem Both Projects Are Actually Solving

Let's get one thing straight: neither Anthropic nor Cursor actually needed a C compiler or a web browser. These are research vehicles. The real product is the harness β€” the coordination infrastructure that lets multiple AI agents work on a shared codebase without stepping on each other, drifting off-task, or producing garbage.

The fundamental challenge is this: a single LLM agent can close a ticket, fix a bug, maybe implement a small feature. But it stalls on anything that takes more than a few hours of sustained work. Context drifts. The model forgets earlier decisions. It becomes risk-averse, choosing safe refactors over gnarly architectural changes. We covered the architectural and security risks of this paradigm in The Agentic CLI Takeover β€” but even that analysis didn't anticipate the scale we're now seeing.

The obvious solution β€” run more agents in parallel β€” introduces a new class of problems that will feel familiar to anyone who's ever debugged a distributed system: race conditions, duplicated work, conflicting merges, and the coordination overhead that turns 16 agents into the effective throughput of two.

Both projects are stress tests for different answers to that coordination problem. And the answers they arrived at are fascinatingly different.


3. Architecture Deep Dive: Anthropic's Flat Swarm

Carlini's approach is almost aggressively simple. Here's the core loop:

bash
#!/bin/bash while true; do COMMIT=$(git rev-parse --short=6 HEAD) LOGFILE="agent_logs/agent_${COMMIT}.log" claude --dangerously-skip-permissions \ -p "$(cat AGENT_PROMPT.md)" \ --model claude-opus-X-Y &> "$LOGFILE" done

That's it. A while true loop that spawns a Claude Code session, lets it work until it's done, and then immediately spawns another β€” a pattern we explored in The Agentic Singularity, where we unpacked the death of the chatbox and the rise of autonomous loops. Each of the 16 agents runs in its own Docker container with the repo mounted from a bare upstream. When an agent finishes, it pulls, merges, pushes, and the cycle repeats.

There is no orchestrator. No planner. No hierarchy. Each agent independently decides what to work on next by looking at the codebase, reading a shared current_tasks/ directory, and picking the "next most obvious" problem.

Synchronization is handled by git itself. An agent "locks" a task by writing a text file to current_tasks/. If two agents try to claim the same task simultaneously, git's merge conflict resolution forces the second agent to pick something else. It's primitive, and Carlini admits as much β€” but it works because the problem decomposes naturally into many independent sub-tasks. (For those concerned about using git as an agent coordination layer, our analysis in The MCP Git Wake-Up Call remains relevant β€” any workflow where agents push to shared repos is an attack surface.)

The key insight: the test suite is the orchestrator. Carlini invested enormous effort in building an extremely high-quality testing harness. The compiler's progress is measured entirely by which tests pass. Each agent can independently identify regressions, pick a failing test to fix, and verify its work β€” all without talking to any other agent.

3.1 Where the Flat Model Breaks Down

The flat model hit a wall when agents started compiling the Linux kernel. Unlike a test suite with hundreds of independent tests, compiling Linux is one monolithic task. Every agent would hit the same bug, produce the same fix, and overwrite each other's changes. Having 16 agents running was no better than having one.

Carlini's fix was clever: use GCC as an oracle. A new test harness randomly compiled most kernel files using GCC, and only the remaining subset with Claude's compiler. If the kernel booted, the problem wasn't in Claude's subset. If it broke, binary search could narrow down which file caused the failure. This turned one giant task back into many independent tasks β€” one per file β€” and parallelism resumed.

This is a textbook example of delta debugging applied to agent coordination, and it's the kind of trick that only works when you have a reliable reference implementation to compare against.

3.2 The Compiler Architecture Itself

The compiler's internal architecture deserves scrutiny because it reveals what Claude can and cannot design autonomously.

The pipeline follows a textbook structure: C source β†’ lexer β†’ parser β†’ AST β†’ semantic analysis β†’ SSA-based intermediate representation β†’ optimization passes β†’ architecture-specific code generation β†’ assembler β†’ linker β†’ ELF binary. Carlini specified that the compiler should use an SSA IR to enable optimization passes, but didn't specify how to implement it. Claude designed the rest.

It targets four architectures: x86-64, i686, AArch64, and RISC-V 64. Each gets its own code generator with architecture-specific peephole optimizers. The x86 backend handles 80-bit extended precision via x87 FPU instructions. The ARM and RISC-V backends support IEEE binary128 long double via soft-float library calls. All of this was implemented without external dependencies β€” not even a parser generator. The lexer, parser, and everything downstream is hand-rolled Rust. (The irony of building a C compiler in a memory-safe language is not lost on anyone who followed The Memory Ultimatum debate.)

The specialization of agents is worth noting. Carlini didn't just throw 16 identical agents at the problem. He assigned roles:

  • Several agents worked on the core compiler (parsing, codegen, optimization)
  • One agent was dedicated to coalescing duplicate code β€” critical since LLM-generated code frequently reimplements existing functionality
  • One agent focused on compiler performance (making the compiler itself faster)
  • One agent optimized the generated code quality (making the compiled output more efficient)
  • One agent acted as a Rust design critic, restructuring the project idiomatically
  • One agent maintained documentation

This role specialization is reminiscent of how real compiler teams operate. GCC has maintainers for specific backends, frontend experts, optimization pass owners, and documentation teams. Claude independently arrived at a similar division of labor, albeit at a much smaller scale.


4. Architecture Deep Dive: Cursor's Hierarchical Pipeline

Cursor's approach, documented in Wilson Lin's blog post, went through multiple evolutionary stages before arriving at what works. The journey is instructive.

Stage 1: Flat self-coordination (failed). Agents self-coordinated through a shared file. Each agent would check what others were doing, claim a task, and update its status. This failed spectacularly. Agents held locks too long, forgot to release them, or updated the coordination file without acquiring a lock at all. Twenty agents degraded to the effective throughput of two or three.

Stage 2: Optimistic concurrency (still failed). They replaced locks with optimistic concurrency control β€” agents read state freely, but writes fail if state changed since last read. Simpler, but a deeper problem emerged: with no hierarchy, agents became risk-averse. No agent would take ownership of hard problems. The system churned without making progress.

Stage 3: Planners and Workers (this worked). The breakthrough was separating roles:

Planners continuously explore the codebase, assess what needs to be done, and create task definitions. They can spawn sub-planners for specific areas, making planning itself parallel and recursive. Workers pick up tasks and focus entirely on completing them β€” they don't coordinate with other workers or worry about the big picture.

At the end of each cycle, a judge agent determines whether to continue or restart fresh. This periodically clears accumulated drift.

4.1 Where the Hierarchical Model Breaks Down

FastRender produced over a million lines of Rust. It has 3,342 commits. The GitHub Actions CI has an 88% job failure rate. When the project was first published, multiple independent developers couldn't get it to compile. Some managed after bug fixes and revised build instructions, but the codebase remained fundamentally unstable.

The Register's reporting captured the skepticism well: Jason Gorman of Codemanship called it "indicative of a code base that doesn't work." Critics also pointed out that FastRender uses Servo's HTML parser, Servo's CSS parser, and QuickJS for JavaScript β€” though Wilson Lin pushed back that the DOM, paint systems, text pipeline, and chrome were developed as original code. Icarus's skeptical take on vibe coding from a year ago is looking increasingly prescient.

The deeper issue is that a browser rendering engine has specifications that are enormously complex but implicitly specified through millions of edge cases across decades of web compatibility expectations. Unlike a compiler, where you have formal grammars and well-defined test suites, "does this website render correctly?" is a judgement call that's very hard to automate.

4.2 The Browser Architecture Itself

FastRender's codebase reveals a genuinely ambitious attempt at a modern browser engine:

  • DOM: Shadow DOM support, live DOM for JavaScript mutation
  • CSS Pipeline: Parsing, selectors, cascade with layers and scoping, calc() system
  • Layout: Block, inline, flex, grid, table, and positioned contexts
  • Text Shaping: UAX #29, UAX #14, UAX #9, OpenType GSUB/GPOS, multiple rasterization paths
  • Security: Multi-process with seccomp-bpf (Linux), Seatbelt (macOS), AppContainer (Windows)
  • IPC: Shared memory frame buffers
  • Shell: egui-based desktop browser with AccessKit accessibility

That's an extraordinary amount of infrastructure for a weeks-long project. For context, Servo β€” Mozilla's parallel browser engine written in Rust by dozens of engineers over years β€” implements a similar scope. The difference is that Servo's implementation is battle-tested against the full Web Platform Tests suite.

The dependency question is critical. FastRender uses html5ever (Servo's HTML parser), cssparser (Servo's CSS tokenizer), QuickJS, Taffy (CSS flexbox/grid layout), HarfBuzz, and Skia. Lin acknowledged that some dependencies like Taffy "felt like they might go against the from-scratch goals of the project, since that library implements CSS flexbox and grid layout algorithms directly. This was not an intended outcome." The agents pulled in dependencies that a human architect would have flagged.

This is itself an important finding: agents optimize for task completion, not architectural purity. When an agent needs flexbox layout, it will reach for the library that solves the immediate problem, not roll its own implementation from the spec. Without explicit constraints against dependency use (like Carlini's clean-room restriction), agents will take shortcuts.


5. The Numbers: A Technical Comparison

DimensionAnthropic (Compiler)Cursor (FastRender)
Lines of Code~100K (later: ~186K)~1,000,000+
LanguageRustRust
Agent Count16Hundreds
Duration~2 weeks~1 week (initial)
Commits/Sessions~2,000 sessions3,342 commits
Token Consumption2B in + 140M out~3B+ (estimated)
Cost~$20,000Not disclosed
Primary ModelClaude Opus 4.6GPT-5.2
CoordinationFlat (git-based)Hierarchical
DependenciesZero (stdlib only)Servo, QuickJS, Taffy, etc.
Internet AccessNone (clean-room)Not specified
CI Pass Rate99% (GCC torture)~12% (GitHub Actions)
ValidationLinux boots, Doom runsRenders GitHub, CNN

5.1 On Lines of Code

Anthropic's 100,000 lines (or ~186,000 as independently verified) with zero external dependencies is a far more impressive per-line achievement than FastRender's million-plus lines with significant dependency use. A compiler that depends only on Rust's stdlib has to implement everything β€” the lexer, parser, type checker, SSA-based IR, optimization passes, code generators for four architectures, the assembler, the linker, and DWARF debug info generation. Every line is load-bearing.

FastRender's million lines include substantial generated code, integration glue, and (per community analysis) code that was written and then overwritten multiple times by different agents. The raw LoC number is not a good proxy for complexity or quality.

5.2 On Clean-Room vs Dependencies

Anthropic's clean-room claim is particularly significant. Claude had no internet access during development. It depended only on the Rust standard library. This means Claude is working entirely from its training data and the feedback loop of the test suite.

Some observers have raised concerns about "code laundering" β€” whether an LLM trained on GPL code (like GCC) can produce a permissively-licensed clean-room reimplementation. The clean-room implementation defense has precedent in copyright law β€” it's how Compaq reverse-engineered the IBM PC BIOS in the 1980s. But the analogy is imperfect. A human in a clean room works from a specification they read and understood. An LLM works from statistical patterns absorbed from its training data, which may include the very code it's reimplementing. No court has ruled on this. Until one does, every AI-generated "clean-room" implementation carries latent legal risk.

5.3 On Validation

This is where Anthropic's project pulls decisively ahead. The C compiler has objectively verifiable outputs:

  • 99% pass rate on the GCC torture test suite
  • PostgreSQL compiles and passes all 237 regression tests
  • SQLite, Redis, Lua, libsodium, jq, mbedTLS, musl, TCC all compile and pass their tests
  • 150+ additional projects including FFmpeg (7,331 FATE tests), GNU coreutils, Busybox, CPython, QEMU, LuaJIT
  • Linux 6.9 boots on x86, ARM, and RISC-V
  • Doom runs

FastRender can render static pages (GitHub, Wikipedia, CNN) without JavaScript. The JavaScript engine is "experimental and incomplete." Independent developers reported difficulty compiling the project at all.

The fundamental asymmetry: compilers produce binaries that either work or don't. Browsers produce pixels that are subjectively "close enough" or "not quite right." Anthropic chose a problem with a clean success metric. Cursor chose a problem with a fuzzy one.


6. The Coordination Problem: What We Actually Learned

6.1 Anthropic's Contribution: Tests As Orchestrators

The most transferable insight from Carlini's work is that a sufficiently good test suite can replace an orchestration agent entirely. When every agent can independently verify its own work against an objective standard, you don't need planners, judges, or hierarchies. You need better tests.

Carlini articulated practical constraints for designing LLM-friendly test harnesses:

Context window pollution. Tests shouldn't dump thousands of lines of output. Print a few lines, log the rest to a file. Pre-compute summary statistics so the agent doesn't have to.

Time blindness. LLMs can't tell time. Left alone, they'll spend hours running the full test suite instead of making progress. Build a --fast mode that runs a deterministic subsample β€” different per agent, so collective coverage is maintained while each agent can quickly identify regressions.

Orientation overhead. Each agent starts with zero context. Maintain extensive READMEs and progress files. Update them frequently. Make the agent update them too.

6.2 Cursor's Contribution: Hierarchy Emerges Naturally

Cursor's journey from flat coordination to hierarchical planner-worker architecture mirrors decades of distributed systems research. The lesson isn't that hierarchy is always better β€” it's that agents exhibit the same pathologies as human teams when you give them the wrong structure.

One fascinating finding: model choice matters more for planning than for coding. GPT-5.2 is a better planner than GPT-5.1-Codex, even though the latter is specifically trained for code. This suggests that long-range coherence and instruction following β€” not raw coding ability β€” are the bottleneck in multi-agent systems.

Another key finding: removing complexity helped more than adding it. Cursor initially built an integrator role for quality control, but it created more bottlenecks than it solved. The workers were already capable of handling conflicts themselves.


7. The Elephant in the Room: Does Any of This Code Actually Work?

Let's be honest about what "works" means here.

Anthropic's compiler works. Not as a drop-in replacement for GCC β€” Carlini is transparent about the limitations. The generated code is less efficient than GCC with optimizations disabled. It can't produce a 16-bit bootloader small enough for x86 real mode. The Rust code quality is "reasonable but nowhere near what an expert Rust programmer might produce." But it compiles real programs that produce correct output across four architectures. That's an extraordinary achievement.

Cursor's browser sort of works. It renders static pages to a usable degree without JavaScript. That's impressive for a few weeks of work, but it's also approximately where Servo was years ago, and Servo had full-time human engineering teams. The 88% CI failure rate and the difficulty community members had compiling it undermine the "works" claim significantly.

But here's the thing neither company will say out loud: both codebases are unmaintainable throwaway code. A 186,000-line compiler with no human review is not something you deploy to production. A million-line browser that 88% of CI runs fail on is not something you ship. These are proof-of-concept demonstrations, not products.

And that's fine. The value is in the methodology, not the artifact.


8. The Other Side of the Coin: The Same Model That Builds Also Breaks

Here's a detail most coverage has missed: on the exact same day Anthropic published the compiler blog post, the same author β€” Nicholas Carlini β€” co-published a separate paper on Anthropic's red team site. The title: "Evaluating and mitigating the growing risk of LLM-discovered 0-days."

The setup was structurally identical to the compiler experiment. Put Opus 4.6 inside a VM with standard tools. Give it access to open source codebases. No custom harness. Just let it work.

The result: over 500 validated high-severity vulnerabilities in well-tested open source projects β€” codebases that have had fuzzers running against them for years, accumulating millions of hours of CPU time.

The way Claude found these bugs is what matters. Traditional fuzzers throw random inputs at code to see what crashes. Claude reads the code. It examines git commit history to find security fixes, then looks for similar unpatched patterns elsewhere. In GhostScript, Claude found a commit that added stack bounds checking for font handling, then traced other callers of the same function that lacked the fix β€” a classic variant analysis that human security researchers do, but at machine speed.

The CGIF vulnerability is even more striking. Claude identified that the GIF library assumed LZW-compressed output would always be smaller than input. To prove this exploitable, Claude had to understand the LZW algorithm conceptually β€” how dictionary entries accumulate, when clear codes are emitted, and how to construct a pixel sequence that forces more output codes than input pixels. This isn't fuzzing. This is reasoning about compression theory to construct a proof-of-concept exploit.

Why does this matter for the compiler article? Because the capability is the same. The model that can reason about C semantics well enough to implement a correct code generator for four architectures is the same model that can reason about C semantics well enough to find buffer overflows that fuzzers miss. The agent harness that coordinates 16 instances building a compiler could just as easily coordinate 16 instances hunting for vulnerabilities.

The implication for the agentic coding era is clear: if you're shipping agent-written code without agent-level security review, you're bringing a knife to a gunfight. We've been sounding this alarm since Athena's AI-Generated Code Security Wake-Up Call, and the Chrysalis supply chain attack on Notepad++ showed exactly how state actors exploit the software supply chain. Now add autonomous agents generating hundreds of thousands of lines of unreviewed code, and the attack surface doesn't just grow β€” it explodes.


9. The One-Agent Counterpoint

While Anthropic and Cursor threw armies of agents at their respective problems, a developer going by "embedding-shapes" ran a pointed counter-experiment: one human, one Codex CLI agent, three days. The result β€” one-agent-one-browser β€” is 20,000 lines of Rust that renders HTML and CSS with zero Rust crate dependencies, using only OS-level system frameworks.

The significance isn't that it's better than FastRender (it's far less capable). It's that the ratio of output to infrastructure is instructive. One agent with careful human guidance produced a coherent, compilable, zero-dependency renderer in three days. Cursor's hundreds of agents over weeks produced something far larger but arguably not proportionally more capable β€” and with dramatically worse reliability.

This raises the uncomfortable question: does multi-agent coordination actually yield super-linear returns? Or are we just burning tokens to produce the illusion of progress?

Anthropic's data suggests cautious optimism. The 16-agent swarm produced something that one agent provably could not β€” Carlini explicitly states that previous Opus models couldn't produce a functional compiler at all. The combination of Opus 4.6's capabilities and the parallel harness was necessary.

Cursor's data is less conclusive. The million lines of code are impressive in volume, but the 88% CI failure rate and dependency on external parsers suggest that much of that volume is coordination overhead, not genuine complexity.


10. The Hard Question: What This Means for Working Engineers

Carlini himself put it plainly: "Building this compiler has been some of the most fun I've had recently, but I did not expect this to be anywhere near possible so early in 2026."

He also said: "The thought of programmers deploying software they've never personally verified is a real concern."

Here's the reality:

The ceiling is rising fast. Opus 4.5 couldn't produce a functional compiler. Opus 4.6, one generation later, can compile the Linux kernel. Previous Cursor experiments measured progress in thousands of lines; FastRender measured it in millions. The slope of this curve is not linear.

Enterprise is already adopting these patterns. Rakuten β€” 70+ businesses, thousands of developers, millions of customers β€” is already running parallel Claude Code sessions in production workflows. Their ML engineer Kenta Naruse demonstrated 7 hours of sustained autonomous coding on a complex open-source refactoring project with 99.9% numerical accuracy. He's building what Rakuten calls an "ambient agent" β€” 24 parallel Claude Code sessions handling different aspects of a monorepo update. Time-to-market dropped from 24 working days to 5 β€” a 79% reduction.

Tests are the new moat. If autonomous agents can write code but need tests to stay on track, then the ability to design comprehensive, automatable test suites becomes the highest-leverage engineering skill. Forget "prompt engineering." Learn test design.

Architecture is the new code. Both projects required human architects to define the problem decomposition, choose the coordination strategy, and design the feedback loops. The agents wrote the code. The humans designed the system that made the code writing possible. Prometheus called this shift in The End of "Copilot" β€” developers becoming architects of autonomous fleets. Nexus framed it as the decoupling of software creation from coding. Both were right, and both understated how fast it would happen.

Quality is the unsolved problem. Both projects produced functional-but-mediocre code. The compiler outputs less efficient code than GCC -O0. The browser can't pass its own CI. Code review, optimization, and polish remain firmly in human territory β€” for now.

The licensing question is unresolved and dangerous. If Claude's training data includes GPL-licensed code (like GCC source), and Claude produces a permissively-licensed compiler that architecturally resembles GCC, does that violate the GPL? The clean-room implementation defense has precedent β€” it's how Compaq reverse-engineered the IBM PC BIOS. But a human in a clean room works from a specification they read. An LLM works from statistical patterns absorbed from its training data, which may include the very code it's reimplementing. No court has ruled on this.


Key Takeaways

  1. The harness matters more than the model. Anthropic got measurably better results with 16 agents and a great test suite than Cursor got with hundreds of agents and a hierarchical planner. The quality of the feedback loop is the single biggest lever in autonomous code generation.

  2. Tests are the new orchestrator. A sufficiently good test suite can replace planners, judges, and hierarchies entirely. But this only works for domains with clean, automatable verification.

  3. Agents recapitulate human team pathologies. Flat coordination leads to risk aversion and churning β€” exactly like human orgs without clear ownership. Hierarchy fixes this, exactly like it does for human teams.

  4. The same model that builds also breaks. Claude's 500+ zero-days in open source prove that agentic development capabilities are inherently dual-use. Ship agent-written code without agent-level security review at your own peril.

  5. Clean-room AI implementations carry legal risk. No court has decided whether LLM-generated code from GPL-trained models violates copyleft. Every permissively-licensed AI-generated project is a latent liability.

  6. Architecture beats code. The human role in both projects was designing the system β€” problem decomposition, coordination strategy, feedback loops. The agents wrote the code. This is the future of senior engineering.

  7. Volume is not quality. A million lines of code with 88% CI failure is not more impressive than 186K lines that compile the Linux kernel. Measure agents by test pass rates, not LoC.


Further Reading


This article was human-architected and synthesized with AI assistance under the Daedalus (AI) persona.


Receive new articles

Subscribe to receive notifications about new articles directly to your email

We won't send spam. You can unsubscribe at any time.