Back to all articles
The Vibe & Verify Fallacy: Why AI-Generated Tests Are Creating a False Sense of Code Quality

The Vibe & Verify Fallacy: Why AI-Generated Tests Are Creating a False Sense of Code Quality

With 84% of devs using AI daily, 'vibe and verify' is the new workflow. But letting AI write tests for its own code creates a confirmation bias trap.

Human-architected research synthesized with the assistance of AI personas.
17 min read

TL;DR / Executive Summary

With 84% of devs using AI daily, 'vibe and verify' is the new workflow. But letting AI write tests for its own code creates a confirmation bias trap.

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 90 seconds:

  • The AI adoption reality: Over 84% of professional developers now integrate generative AI tools into their daily coding routines. This speed of generation has birthed the "vibe and verify" workflow: generating code on a gut feeling and validating it afterward.
  • The confirmation bias trap: Humans are cognitively wired to seek confirmation of success. When reviewing syntactically perfect, AI-generated code, developers suffer from anchoring bias, overlooking subtle logical errors and architectural gaps.
  • Tautological testing: Allowing an AI assistant to write both the production code and the unit tests corresponding to it creates a closed circle of confirmation. The AI repeats its own logical bugs in the assertions and mock definitions, guaranteeing that tests pass while leaving critical flaws untouched.
  • The explanation illusion: Detailed chain of thought explanations generated by LLMs make humans significantly more likely to accept buggy code, mistaking fluent logic-sounding descriptions for operational correctness.
  • The path forward: Re-establish software quality by decoupling generation from validation. Write adversarial test prompts, apply human-led test-driven development, and enforce strict, checklist-based peer reviews rather than relying on automated code-and-test loops.

The speed at which we can manifest software has outpaced our ability to reason about its correctness. With generative artificial intelligence tools now a staple of professional software engineering, the bottleneck of coding is no longer syntax, API lookups, or boilerplate creation. According to the recent McKinsey DORA Insights, roughly 84% of developers now integrate generative AI assistants into their daily workflows.

This transformation has shifted the primary engineering loop. Instead of writing code from scratch, developers act as prompt architects and editors. This workflow has been described as the "vibe and verify" paradigm: a developer describes a feature, generates a block of code based on a prompt ("vibe"), and then reviews the code to confirm it functions correctly ("verify").

On the surface, this workflow appears highly efficient. However, it harbors a systemic vulnerability at the boundary between human oversight and machine output. The process of verification is not a neutral, objective act. It is performed by human minds that are susceptible to cognitive distortions, primarily confirmation bias and anchoring bias. When we let AI write both the code and the unit tests that validate that code, we construct a closed circle of confirmation. The tests do not break the code; they simply codify the model's initial mistakes, presenting an illusion of coverage and quality that crumbles under production loads.


The Cognitive Architecture of the Review Trap

To understand why verification fails in the AI-assisted loop, we must examine the cognitive mechanics of code review. Manual code review has always been a challenging task. When an engineer reads code written by another human, their brain performs a mental compilation, tracing execution paths, evaluating state changes, and searching for boundary violations.

Generative AI disrupts this process by introducing two distinct forms of cognitive bias:

1. Anchoring Bias

When an AI assistant generates a block of code, it produces syntactically flawless, well-commented, and structurally clean output. This high-fidelity presentation acts as a cognitive anchor. The reviewer accepts this output as the baseline truth. Rather than asking how the feature should be built from first principles, the reviewer’s focus narrows to verifying if the generated code looks correct. The developer is no longer architecting; they are editing. Anchoring makes it exceptionally difficult to spot missing edge cases, because the mind is occupied with parsing the paths that are present, rather than conceptualizing the paths that should be present.

2. The Illusion of Oversight

A critical finding from OpenAI's research on human-AI oversight shows that human verification accuracy drops significantly when an AI provides detailed explanations of its reasoning. When an LLM outputs code alongside a step-by-step "chain of thought" explanation, the human reviewer is psychologically disarmed. The fluent, authoritative tone of the explanation creates a false sense of security. The reviewer reads the explanation, agrees with the logic, and projects that agreement onto the code itself. In practice, LLMs can write highly persuasive explanations for code that contains severe, silent runtime errors. The explanation is verified, but the code remains broken.

Under the pressure of delivery schedules, the "verify" stage of "vibe and verify" often decays into a superficial check: does the code compile, and does it pass a basic happy-path execution? If the answer is yes, the developer moves to push.


The Circle of Confirmation: Tautological Testing

The vulnerability of "vibe and verify" becomes critical when teams automate the creation of unit tests. Recognizing that manual code review is fallible, developers often ask the same AI assistant that wrote the production code to generate the corresponding unit tests.

This is where the closed circle of confirmation locks. An LLM generates code based on its internal parameters, weights, and context. If that model makes a logical error or a false assumption about a dependency, that same error or assumption is baked into its internal representation of the problem. When prompted to "write unit tests for this code," the model does not independently audit the code. It translates its existing internal state—including any bugs—into tests.

The result is tautological testing: tests that are designed to validate the code's current behavior, rather than its required behavior. The tests pass because they share the same blind spots as the code.

A Concrete Failure Case: The Non-Atomic Distributed Rate Limiter

Consider a common scenario: building a distributed, Redis-backed rate limiter for an API. A developer prompts an AI assistant to generate a TypeScript implementation of a token bucket rate limiter. The AI outputs the following code:

typescript
// AI-Generated Rate Limiter containing a concurrency vulnerability export class RedisRateLimiter { private redis: any; private capacity: number = 10; private refillRate: number = 0.01; // tokens per millisecond constructor(redisClient: any) { this.redis = redisClient; } public async allowRequest(userId: string): Promise<boolean> { const key = `limits:${userId}`; const rawBucket = await this.redis.get(key); // Read State const now = Date.now(); let bucket = rawBucket ? JSON.parse(rawBucket) : { tokens: this.capacity, lastRefill: now }; // Calculate refill const elapsed = now - bucket.lastRefill; const refillTokens = elapsed * this.refillRate; bucket.tokens = Math.min(this.capacity, bucket.tokens + refillTokens); bucket.lastRefill = now; // Evaluate limit if (bucket.tokens >= 1) { bucket.tokens -= 1; await this.redis.set(key, JSON.stringify(bucket)); // Write State return true; } return false; } }

To the human reviewer vibing and verifying, this code looks correct. It reads the bucket state, calculates the elapsed time, adds the refilled tokens, checks if there are enough tokens, decrements the count, saves the state, and returns the decision. The logic is clean and easy to follow.

However, this code contains a severe concurrency bug. The read operation (this.redis.get) and the write operation (this.redis.set) are separate network round-trips. They are not atomic. In a distributed backend environment, if a client fires two concurrent API requests almost simultaneously, both execution threads will execute the read step before either has completed the write step. They will both read the same token count, perform the same calculation, decrement the same value, and write back identical states. Both requests will be allowed, bypassing the rate limit. Under load, this implementation will fail to protect downstream services.

Now, the developer asks the AI to write unit tests for this class. The AI generates the following test suite:

typescript
// AI-Generated Unit Test validating the buggy rate limiter import { RedisRateLimiter } from "./rate-limiter"; describe("RedisRateLimiter", () => { let mockRedis: any; let limiter: RedisRateLimiter; beforeEach(() => { mockRedis = { get: jest.fn(), set: jest.fn(), }; limiter = new RedisRateLimiter(mockRedis); }); it("should allow request when tokens are available and decrement them", async () => { const initialBucket = { tokens: 5, lastRefill: Date.now() - 1000 }; mockRedis.get.mockResolvedValue(JSON.stringify(initialBucket)); mockRedis.set.mockResolvedValue("OK"); const result = await limiter.allowRequest("user-123"); expect(result).toBe(true); expect(mockRedis.get).toHaveBeenCalledWith("limits:user-123"); // Verify that the tokens were decremented expect(mockRedis.set).toHaveBeenCalledWith( "limits:user-123", expect.stringContaining('"tokens":10') // Refilled due to time elapsed, then decremented ); }); it("should deny request when bucket is empty", async () => { const emptyBucket = { tokens: 0, lastRefill: Date.now() }; mockRedis.get.mockResolvedValue(JSON.stringify(emptyBucket)); const result = await limiter.allowRequest("user-123"); expect(result).toBe(false); expect(mockRedis.set).not.toHaveBeenCalled(); }); });

These tests pass. They achieve 100% statement and branch coverage for the allowRequest method. When the developer runs the test suite in their CI pipeline, everything is green. The SonarQube quality gate is satisfied.

Why did the tests pass? Because the test suite was designed to confirm the model's assumptions. The mock setup mirrors the sequential execution model assumed by the code. The tests do not model concurrent requests, network delays, or atomic interleaving. The unit tests are tautological: they prove that the code behaves exactly like the code behaves. They fail to verify if the code behaves like a correct distributed system should.


Concurrency Breakdown: The Event Loop vs. The Database

To understand why this unit test passes while production fails, we must look at how JavaScript handles asynchronous operations. Node.js executes code in a single thread using an event loop. However, database operations—like querying Redis—are asynchronous and delegated to the system's underlying thread pool or network stack.

When allowRequest("user-123") is executed, the runtime hits the first await this.redis.get(key). The event loop suspends execution of this specific function and moves to process other tasks on the microtask queue. If a second request for user-123 arrives at this exact moment, it starts execution and also hits the await this.redis.get(key) line.

At this point, both requests are waiting for Redis to return the same key. When Redis responds, both operations are scheduled back onto the event loop. They resume sequentially, using the exact same snapshot of the token bucket.

The second request overwrites the write operation of the first request. Instead of decrementing the bucket twice (leaving 3 tokens), the bucket state is decremented once and written twice as 4 tokens. In an API gateway under a traffic spike, this non-atomic implementation allows thousands of unauthorized requests to slip through, potentially overwhelming downstream database nodes.

The AI-generated tests failed to catch this because they mocked Redis as a simple, synchronous key-value store that executes in total isolation. There is no simulation of overlapping promises, network latency, or state mutations occurring during the asynchronous gap.


Correcting the Bug: Atomic Scripting

To correct this vulnerability, the rate limiter must execute the read, calculation, and write steps atomicaly inside Redis. In Redis, atomicity is achieved either through transactions using pessimistic locking (WATCH/MULTI/EXEC) or by executing a Lua script directly on the database engine. Since Redis executes Lua scripts in a single-threaded fashion, no other command can run while the script is active, preventing any interleaving.

A senior engineer, reviewing the AI's buggy draft, would replace the entire asynchronous JavaScript calculation with an atomic Redis Lua script. Here is the corrected, production-ready implementation:

typescript
// Corrected, Atomic Redis Rate Limiter export class RedisRateLimiter { private redis: any; private capacity: number = 10; private refillRate: number = 0.01; // tokens per ms // Lua script executed atomically inside Redis private readonly rateLimitScript = ` local key = KEYS[1] local capacity = tonumber(ARGV[1]) local refillRate = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local requested = 1 local rawBucket = redis.call('get', key) local bucket = { tokens = capacity, lastRefill = now } if rawBucket then local decoded = cjson.decode(rawBucket) bucket.tokens = tonumber(decoded.tokens) bucket.lastRefill = tonumber(decoded.lastRefill) end -- Refill calculation local elapsed = now - bucket.lastRefill local refillTokens = elapsed * refillRate bucket.tokens = math.min(capacity, bucket.tokens + refillTokens) bucket.lastRefill = now -- Evaluation if bucket.tokens >= requested then bucket.tokens = bucket.tokens - requested redis.call('set', key, cjson.encode(bucket)) return 1 -- Allowed else return 0 -- Denied end `; constructor(redisClient: any) { this.redis = redisClient; } public async allowRequest(userId: string): Promise<boolean> { const key = `limits:${userId}`; const now = Date.now(); // Execute the script atomically inside the Redis engine const result = await this.redis.eval( this.rateLimitScript, 1, // Number of keys key, this.capacity.toString(), this.refillRate.toString(), now.toString() ); return result === 1; } }

By pushing the state evaluation logic into Redis, we eliminate the asynchronous network gap between the read and write operations. The entire evaluation is now atomic.


The Mocking Trap and Architectural Drift

The rate limiter failure case highlights a broader industry problem: the Mocking Trap. Generative AI tools are excellent at creating mocks for external databases, messaging queues, and third-party APIs. However, because LLMs generate text based on statistical probabilities rather than an actual understanding of system state, they frequently make incorrect assumptions about how those external dependencies behave in the real world.

When an AI writes tests, it mocks these external systems based on its own assumptions. If the model incorrectly believes that a database driver throws a specific exception class under a unique constraint violation, it will mock that exception in the test suite and write production catch blocks that handle it. The tests will pass because the mock and the catch block are in perfect alignment. In production, however, the actual database driver might return an error code instead of throwing, causing the application to crash.

This mismatch leads to architectural drift: a state where the codebase and its tests are in complete agreement with each other, but in complete disagreement with the physical infrastructure they run on.

This drift is particularly dangerous because it bypasses traditional metrics of code quality. Standard test coverage metrics (such as line, branch, and function coverage) only measure which lines of code were executed during a test run. They do not measure if the assertions were semantically valid, or if the mock inputs represent real production environments. A team can easily achieve 100% test coverage with AI-generated tests while shipping a system that fails under its first real database transaction.


Rebuilding Verification Rigor

To survive the AI-assisted era without a massive regression in software reliability, engineering organizations must establish a separation of concerns. We must break the closed loop of confirmation by ensuring that the agent or mind generating the implementation is never the one generating the validation.

Here are four practical engineering patterns to de-risk "vibe and verify":

1. Adversarial Test Generation (Red-Teaming the Prompt)

Instead of asking an AI assistant to "write unit tests for this code," prompt a separate AI session (or a different model family) with the role of an adversarial QA engineer. Use a prompt that explicitly demands destruction. For example, to verify our rate limiter, we would provide the implementation code and use the following prompt:

"Act as a senior QA engineer specialized in distributed systems and concurrency testing. Analyze the attached code for race conditions, non-atomic database access, serialization bugs, and error-handling gaps. Write a Jest test suite that actively attempts to expose these vulnerabilities. Specifically, simulate concurrent execution pathways using Promise.all to test the code under race conditions. Do not mock external database clients as simple synchronous returns unless you also simulate overlapping states."

If we apply this adversarial prompt, the secondary model is far more likely to identify the lack of atomic operations and generate a test that executes concurrent requests, exposing the race condition before the code leaves the developer's machine.

2. Human-Led Test-Driven Development (TDD)

The most effective way to prevent anchoring bias is to write the tests before generating the production code. By defining the interface, assertions, and invariants first, the developer anchors the requirements in the test suite.

When the AI is subsequently asked to write the code, it must satisfy the pre-existing assertions. This forces the model to write code that conforms to external requirements, rather than generating code and then bending the requirements to match.

3. Assertion-Rich Implementations and Invariants

Design production code with strict internal validation. Use runtime assertions to enforce invariants (conditions that must always remain true). If a rate-limiting bucket should never exceed its capacity or fall below zero, assert it directly in the logic:

typescript
if (bucket.tokens < 0 || bucket.tokens > this.capacity) { throw new Error(`Invariant violation: tokens out of bounds (${bucket.tokens})`); }

These checks act as runtime guardrails, exposing logical drift that unit tests might have missed due to incorrect mocking.

4. Checklist-Based Peer Reviews

Human peer review must move away from reading code for general flow and syntax. Code review processes should be driven by checklists that target known failure modes of AI code:

  • Concurrency: Are read/write operations against shared state protected by locks, transactions, or atomic primitives?
  • Mocks: Do unit test mocks align with the documented behavior of external dependencies, including error codes and timeout thresholds?
  • Error Handling: Are catch blocks handling the actual errors emitted by dependencies, or are they handling assumed exceptions?
  • Invariants: What are the system invariants, and are they explicitly asserted?

The Verification Architecture

The diagram below outlines the structural difference between the flawed confirmation loop of vibe-and-verify and a robust, decoupled verification architecture.


Conclusion

Generative AI has democratized code creation, but it has also democratized the creation of bugs. The speed of "vibe and verify" is intoxicating, but it replaces structural verification with a false sense of security. Code coverage is a metric of execution, not correctness. High coverage with tautological tests is worse than no coverage at all; it provides a green light to deploy systems that are fundamentally broken under the surface.

As Staff+ engineers, our value is no longer measured by the volume of code we can write, but by the rigor of our verification. We must refuse to let the same machine mark its own homework. By decoupling the generation of software from its validation, we can harness the velocity of AI without sacrificing the safety and correctness of our systems.


External Sources


This article was human-architected and synthesized with AI assistance under the Athena (AI) persona.

Receive new articles

Subscribe to receive notifications about new articles directly to your email

We won't send spam. You can unsubscribe at any time.