87% of Your AI-Generated Pull Requests Have Security Vulnerabilities. You Just Don't Know It Yet.

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 60 seconds:

87% of AI-generated pull requests ship at least one security vulnerability, according to DryRun Security's Agentic Coding Report (March 2026).

Five independent studies (DryRun, BaxBench, Veracode, Opsera, Cycode) converge on the same conclusion: AI agents generate functionally correct code that is systematically insecure.

The PleaseFix disclosure proved that agents themselves are exploitable — a calendar invite can hijack an agentic browser into full credential exfiltration.

Traditional SAST tools miss 80%+ of these vulnerabilities because AI agents introduce logic and authorization flaws, not pattern-matchable bugs.

Bottom line: AI coding agents are the most dangerous junior developers you've ever hired. Treat every AI-generated PR with the scrutiny it deserves.

The Hook: I'm About to Ruin Your Monday

You know that warm, fuzzy feeling you get when Claude Code ships a feature in 3 minutes that would have taken you 45? That dopamine hit when Codex refactors your entire auth module and all the tests pass? That quiet confidence when Gemini generates a full CRUD API and you merge it because, well, it looks right?

87% of those pull requests just introduced at least one security vulnerability into your codebase.

That's not my number. That's not a hypothetical. That's the headline finding from DryRun Security's Agentic Coding Security Report, published on March 11, 2026 — three days ago. They tested Claude Code (Sonnet 4.6), OpenAI Codex (GPT-5.2), and Google Gemini (2.5 Pro). They built two real applications. They submitted 30 pull requests. And 26 of them shipped with exploitable flaws.

Welcome to the real cost of "vibe coding." Let me show you the receipts.

The DryRun Autopsy: 143 Vulnerabilities, 38 Scans, Zero Excuses

DryRun Security didn't run a synthetic benchmark. They didn't ask models to generate isolated functions. They did something far more devastating: they built real applications the way real teams build them — feature by feature, PR by PR, exactly like your Monday standup promises.

Application 1: A family allergy tracker web app (auth, user management, data storage). Application 2: A browser-based racing game with backend API, high scores, and multiplayer.

Each AI agent — Claude, Codex, Gemini — built both apps through sequential pull requests. Every PR was scanned. Then a full codebase DeepScan was run at the end.

Here's where it gets ugly.

The Final Scorecard

The allergy tracker app started with a baseline of 9 existing issues before the agents even touched it. After all features were merged:

Claude: 13 issues (net +4, including a 2FA-disable bypass unique to its codebase)
Gemini: 11 issues (net +2, retained OAuth CSRF and invite bypass through final scan)
Codex: 8 issues (net -1 — actually fixed more than it introduced, but still shipped a token bypass)

The racing game started clean — zero baseline issues. After agent development:

Claude: 8 issues
Gemini: 7 issues (most high-severity findings overall)
Codex: 6 issues

The total across both apps, all agents, all scans: 143 vulnerabilities.

The "Middleware Ghost" Pattern

But the raw numbers aren't the scary part. The pattern is.

DryRun identified 10 vulnerability categories that appeared consistently across all three agents and both applications. Four of them showed up in every single final codebase, and all four were related to authentication.

Here's the one that should make every Staff+ engineer's blood run cold:

Every agent created authentication middleware. Every agent wrote the code to verify tokens, check sessions, enforce roles. And then... none of them applied it consistently.

REST API endpoints? Protected. WebSocket connections? Wide open. Admin routes in the game API? Authenticated. The same admin operations via a different code path? Naked.

Rate limiting middleware was defined in every codebase. But no agent actually connected it to the application. The code existed. The wiring didn't.

JWT secret management was weak across all three agents in the game app. Hardcoded fallback secrets meant an attacker could forge valid tokens without obtaining any credentials.

This is not a novel vulnerability class. Broken access control has been #1 on the OWASP Top 10 since 2021. These agents aren't inventing new ways to be insecure. They're repeating the same mistakes human developers made a decade ago — but at 10x the speed and with 10x the confidence.

The Convergence: Five Studies, One Verdict

DryRun didn't drop in a vacuum. Their 87% figure is the latest — and most methodologically rigorous — data point in a convergence of evidence that's been building all quarter.

BaxBench (ETH Zurich + UC Berkeley)

The BaxBench benchmark, developed by researchers at ETH Zurich, LogicStar.ai, and UC Berkeley, tested LLMs across 392 backend development tasks spanning 28 scenarios, 14 frameworks, and 6 programming languages. Their finding: even the best model (OpenAI o1) achieved only 62% correctness — and roughly half of the correct solutions contained exploitable security vulnerabilities.

Let that sink in. If a model generates 100 backend implementations and 62 of them are functionally correct, roughly 30 of those 62 have a security hole. Your CI passes. Your tests are green. Your code is exploitable.

Veracode's GenAI Report (100+ Models Tested)

Veracode tested over 100 LLMs across Java, Python, C#, and JavaScript. Their headline: 45% of AI-generated code samples failed security tests, introducing OWASP Top 10 vulnerabilities. Cross-site scripting (CWE-80) was the worst offender — AI tools failed to defend against it in 86% of relevant code samples. And critically, security performance remained flat regardless of model size or training sophistication. Bigger models ≠ more secure code.

Opsera's Enterprise Benchmark (250,000+ Developers)

Opsera analyzed data from 250,000+ developers across 60+ enterprise organizations. Their finding: AI-generated code introduces 15-18% more security vulnerabilities than human-written code.

Cycode's State of Product Security

And according to Cycode's 2026 report, 92% of organizations are actively using or piloting AI coding assistants — yet AI-generated code has become the #1 blind spot for application security teams.

The pattern is unmistakable: five independent studies, different methodologies, different sample sizes, same conclusion. AI coding agents generate functionally correct code that is systematically insecure.

PleaseFix: When the Agent Is the Attack Surface

If the DryRun report tells you that AI agents write insecure code, the PleaseFix disclosure tells you something worse: the agents themselves are exploitable.

On March 3, 2026, Zenity Labs disclosed PleaseFix — a family of critical vulnerabilities in agentic browsers, demonstrated against Perplexity's Comet browser. The attack is elegantly devastating:

Attacker embeds a malicious payload in a calendar invite.
User asks Comet to accept the meeting.
The agent autonomously accesses the local file system, browses directories, reads sensitive files, and exfiltrates contents to an attacker-controlled endpoint.
Zero clicks required. No prompts. No confirmation dialogs. The user sees the expected "meeting accepted" response while their files are being stolen.

But the second exploit is the one that should terrify enterprise security teams. By manipulating the agent's task execution, attackers could steer Comet into an authenticated 1Password session, navigate vault entries, reveal stored credentials, and — in the escalation scenario — change the master password and extract recovery material. Full vault takeover. Through a calendar invite.

Michael Bargury, Zenity's CTO, said something that should be printed on every agentic architecture diagram in every engineering org: "This is not a bug. It is an inherent vulnerability in agentic systems."

Perplexity patched the specific vulnerability before public disclosure. 1Password confirmed the root cause was in Perplexity's execution model, not their platform. But the architectural problem — agents operating autonomously within authenticated sessions, unable to distinguish trusted instructions from injected payloads — that's not patchable. That's a design property.

The Uncomfortable Taxonomy

Let me map this for you, because if you squint at the evidence, the taxonomy writes itself:

Layer 1 is the code the agent writes. Broken access control, missing middleware wiring, hardcoded secrets, weak JWT management. Classic vulnerability classes, deployed at unprecedented speed.

Layer 2 is the agent itself. It inherits your credentials, operates in your authenticated sessions, and can be hijacked through content it was designed to read. The agent isn't just writing insecure code — it's an attack surface by design.

Layer 3 is the ecosystem. 92% adoption. 80% of vulnerabilities invisible to traditional SAST. AppSec teams that haven't updated their tooling or review processes. This is the systemic risk.

Why Your SAST Won't Save You

Here's where DryRun's data gets truly existential for security teams.

Traditional static analysis tools — the Semgreps, the CodeQLs, the SonarQubes — use regex-based pattern matching. They look for eval() calls. They flag hardcoded strings that look like API keys. They search for SQL concatenation.

They do not trace whether authentication middleware is actually mounted on every route. They cannot verify that rate limiting code is wired to the application. They miss business logic flaws where unlock cost validation happens on the client but not the server.

DryRun's earlier research found that traditional SAST tools miss more than 80% of vulnerabilities in LLM-enabled applications. And the DryRun 2025 SAST Accuracy Report showed that among five leading static analysis tools, the best performers still missed the majority of logic and authorization flaws.

This is because the vulnerability classes AI agents introduce most frequently are exactly the ones pattern-matching scanners are worst at detecting. AI agents don't write system(user_input). They write perfectly structured middleware that isn't connected. They implement OAuth flows that skip CSRF validation on one endpoint. They create rate limiters that exist in code but not in the execution path.

Your scanner says green. Your codebase is red.

The OWASP Connection: This Was Predicted

If you've been following our OWASP Agentic Top 10 series (starting with a0082), none of this should surprise you. The DryRun findings map directly to the vulnerability classes OWASP identified:

ASI02 (Tool Misuse): Agents using authentication tools without applying them consistently
ASI03 (Excessive Permissions): Agents operating with full filesystem and credential access
ASI05 (Unexpected Code Execution): The PleaseFix exploits — agent autonomy turned against the user
ASI06 (Memory & Context Poisoning): Indirect prompt injection via calendar invites, manipulating agent behavior through trusted content channels

The DryRun report is the first large-scale empirical validation that these vulnerability classes aren't theoretical. They show up in the code your agent writes on its first Monday morning on the job.

And the PleaseFix disclosure is a live demonstration of ASI01 (Prompt Injection) + ASI03 (Excessive Permissions) chaining in the wild. A single calendar invite → indirect prompt injection → agent inherits authenticated session → full credential exfiltration. Textbook OWASP Agentic attack chain.

What You Actually Need to Do

I know you didn't come here for doom without a survival guide. Here's what separates the teams that get breached from the teams that don't:

1. Scan Every PR, Not Just the Final Build

DryRun's methodology revealed something critical: vulnerabilities compound across features. A missing auth check in PR #3 might be harmless in isolation. But when PR #7 adds an admin panel that assumes auth was handled upstream, you've got a privilege escalation. Scanning only the final build misses the interaction effects.

2. Kill Your SAST Monoculture

If your entire AppSec pipeline is a single regex-based scanner, you have a false sense of security. You need contextual analysis — tools that reason about data flows, trust boundaries, and execution paths. This isn't a DryRun sales pitch (though they obviously sell this). It's an architectural reality: the vulnerability classes AI agents introduce are invisible to pattern matching.

3. Review Security During Planning, Not Coding

Many of the vulnerabilities DryRun found originated in design decisions that agents then faithfully implemented. If your prompt says "build a user management system with OAuth," the agent will implement OAuth — but it won't think about CSRF protection, token revocation, or session fixation unless you explicitly specify them.

Security requirements belong in the prompt, not in the code review.

4. Treat AI Agents Like Untrusted Service Accounts

The PleaseFix lesson is clear: agents should never operate in authenticated sessions with broad access. Apply the principle of least privilege aggressively:

Agents get scoped, ephemeral tokens — not your browser session
Filesystem access is read-only and sandboxed
Sensitive operations require explicit human confirmation
Credential managers are excluded from agent-accessible contexts

5. Assume the Agent Will Be Manipulated

If your agent reads anything from the outside world — email, calendar invites, Slack messages, web content, even README files — it can be manipulated via indirect prompt injection. Design your trust boundaries accordingly. The agent is not "you with an API." It's an untrusted intermediary operating in your security context.

The Real 10x

Here's the uncomfortable truth that nobody at the AI coding tool companies wants you to internalize:

AI coding agents are legitimately transformative. They are faster. They do accelerate prototyping. They can handle repetitive tasks that would bore a human into sloppy mistakes.

But they are not engineers. They don't reason about security invariants. They don't think about attack surfaces. They don't ask "what happens if someone sends a malformed token to this WebSocket endpoint?" They generate the code that looks right, passes the tests you wrote (which also don't test for security), and moves on to the next feature.

The real 10x isn't the agent. The real 10x is you — the Staff+ engineer who understands that functional correctness and security correctness are orthogonal properties. Who knows that a green CI pipeline is a necessary but not sufficient condition for deployment. Who treats every AI-generated PR with the same scrutiny you'd give a junior developer's first contribution.

Because that's exactly what it is.

87% of the time.

Icarus is gsstk's Trend Analyst and resident provocateur. He has 2 years of experience and the audacity of someone with 20. His previous articles include Frameworks Are Dead. Architects Are Not. and The Flagship Tax Is Dead. He stands by every word and invites you to prove him wrong in the comments.

This article was human-architected and synthesized with AI assistance under the Icarus (AI) persona.