AI Writes the Tests Too — And That's the Real Problem
Your CI pipeline is green. All tests pass. The feature demo works perfectly. And the logic is quietly, completely wrong.
This is not a hypothetical. It is the failure mode that a solo developer named Kyunghwan Kim documented on June 21, 2026, on the day he publicly launched Loupe — a platform built around a specific, uncomfortable observation: AI coding agents like Claude Code and OpenAI Codex consistently produce tests that are written to pass, not to verify. The code ships. The tests confirm it. The bug rides along, invisible, until it surfaces in a support ticket weeks later as a subtly wrong aggregate number or a silently dropped row that nobody notices until a quarterly report doesn't add up.
The instinct is to call this an "AI makes mistakes" story. It isn't. It's a story about what happens when you remove the structural independence that made automated testing work in the first place.
The Testing Contract We Forgot We Had
Before AI coding agents became a default workflow tool, test-driven development rested on an assumption so obvious nobody wrote it down: the person writing the tests was not the same cognitive process that wrote the implementation. That independence was the point. A human who writes tests first is forced to articulate what the code should do before seeing how it does it. A human who writes tests after implementation at least brings a separate mental pass — a different moment in time, a different frame of attention — to the question of correctness.
That separation gave tests their error-detection power. Tests were adversarial by nature, even when written by the same developer, because the act of writing them required stepping outside the implementation's logic and asking: what would break this?
AI coding agents collapsed that independence in a single generation pass. When you prompt Claude Code or Codex to implement a feature and they return both the function and the test suite in one response, those tests are not independent verification. They are consistency checks against the agent's own interpretation of the requirement. The agent understood the task in a particular way, implemented it accordingly, and then wrote tests that confirm the implementation behaves as the agent intended. If the agent's interpretation was slightly off — if it parsed "active users in the last 30 days" as calendar days rather than rolling window, for instance — the implementation will be wrong, the tests will pass, and CI will be green for exactly the wrong reason.
This is what Kim means when he describes "tests written to pass rather than validate feature correctness." It's a precise diagnosis, not a complaint about AI quality.
What Loupe Actually Does
The platform Kim launched presents users with real AI-generated code: functions, modules, business logic. The code runs. The tests pass. The task is to find what's wrong without executing anything.
This is a harder exercise than it sounds, and deliberately so. Loupe is not a debugging environment — it's a code-reading gym. The bugs it surfaces are a specific class: silent semantic errors that produce no exceptions, no stack traces, no 5xx responses. A function that returns the wrong value for valid inputs. A filter condition that silently drops edge-case rows. A guard clause positioned after the logic it was meant to guard, meaning it never executes. These are invisible to linters. They are invisible to type checkers. They are invisible to mutation testing tools, for reasons we'll get to shortly. They pass code review when the reviewer trusts the green test output instead of reading the logic independently.
Kim cites a concrete incident in his Dev.to post announcing the project: a teammate shipped a complete feature and could not, when asked, answer basic questions about how many endpoints the implementation had or what happened at the boundary conditions. The teammate hadn't written the code. They had reviewed the AI's output, seen passing tests, and merged. The understanding was never there to lose — it was never acquired.
That's the compounding risk. A single engineer doing this is a bug risk. An entire squad doing this is a knowledge collapse: no one on the team has a reliable mental model of the service, there are no human-authored review comments to reconstruct intent from, and the next engineer to touch the code inherits AI-generated comments explaining AI-generated logic with no human judgment anywhere in the chain.
The Correlated Artifacts Problem
Static analysis tools — SonarQube, ESLint, Semgrep — catch pattern-level issues: known anti-patterns, security smells, style violations. They are genuinely useful and should be running on every AI-generated PR. But they are structurally blind to semantic correctness. A function that computes the wrong result for valid inputs is not a pattern violation. It is correct syntax expressing incorrect logic, and no linter has a rule for that.
Mutation testing tools like Stryker or Pitest are often proposed as a more powerful alternative. They work by injecting known faults into your code and checking whether your test suite catches them. If your tests don't detect a mutant, the tests are weak. This is a real improvement over baseline coverage metrics, and it catches a class of underspecified tests that static analysis misses.
But mutation testing has a critical assumption baked in: it assumes your tests encode the correct specification. It can only reveal whether your tests catch deviations from what they currently assert. If both the implementation and the tests are wrong in the same direction — if they both implement the agent's misread of the requirement consistently — mutation testing will report a healthy test suite. The mutants that survive will be irrelevant edge cases. The fundamental semantic error will be invisible.
This is the correlated artifacts problem. The tests and the implementation are not independent sources of truth; they are two outputs of the same generative pass, carrying the same misunderstanding. Green CI, under these conditions, is evidence only that the code is internally consistent. It says nothing about whether the code is correct with respect to the actual requirement.
Loupe's deliberate-practice approach is the only path that doesn't depend on the test suite's correctness as a baseline. It trains human reviewers to read logic and spot errors independently — which is the only mechanism that can catch the correlated failure mode.
The Specification Gap Nobody Is Talking About
Here is the part that should make you uncomfortable: most AI-generated bugs in the correlated-artifact class are not cases where the AI was wrong. They are cases where the AI was precisely right about the wrong thing.
The agent understood a requirement. It implemented that requirement consistently, tested it accurately, and delivered correct code for the specification it parsed. The bug lives entirely outside the code — it lives in the gap between what was asked and what the agent read as the ask. "Active users" versus "users who completed an action." "In the last month" versus "since the first of the month." "Soft delete" versus "archive." These are not ambiguities that appear in the code. The code is unambiguous. The ambiguity was in the prompt, and the agent resolved it silently, in its own direction, without surfacing the interpretation for review.
This means that even a developer with expert code-reading skills, practicing diligently on Loupe every week, will miss a substantial class of production bugs — because those bugs require comparing the implementation against the original requirement, not just reading the implementation in isolation. The code is not wrong by any internal measure. It is wrong relative to what the product manager meant, or what the API contract says, or what the database schema implies.
This points toward something the industry has not yet built: AI-assisted requirement tracing. Not AI code review — that exists, and it carries the same correlated-artifact problem when the reviewing model is the same family as the generating model. Requirement tracing: tooling that takes a natural-language spec, an implementation, and a test suite and asks whether the implementation and tests actually address the spec as written, not just whether they are internally consistent. That tooling does not meaningfully exist yet. Loupe is a workaround for its absence.
What Your Team Should Actually Change
The two-source rule is the most actionable immediate response to this problem: for any AI-generated function that touches business logic, the implementation and the boundary-condition tests must come from epistemically independent sources. The AI writes the implementation; a human or a separately-prompted AI session — with no access to the implementation, only the requirement — writes at least the edge-case tests. Keeping test authorship independent from implementation authorship restores the adversarial relationship that made testing useful.
This does not mean mandating "humans write all tests." That negates the productivity gain from AI agents in a way that teams will not sustain. It means being specific about which tests need independence: the happy-path tests can come from the same generation pass as the implementation. The boundary conditions, the error cases, the data-shape edge cases — those need a separate pass, because those are precisely where an agent's misread of the requirement will surface as a quietly wrong result rather than a visible failure.
Your on-call runbooks also need updating. The current generation of runbooks is designed around a failure taxonomy built on crashes, exceptions, timeouts, and 5xx errors — things that are detectable at the infrastructure layer. The failure class Loupe surfaces doesn't appear in that taxonomy. Silent semantic bugs produce correct HTTP 200 responses with wrong data. They have no error rate, no latency spike, no alert. The mean time to detect is measured in days or weeks, not minutes, and detection typically requires a human noticing that a business metric looks off. Runbooks need a new section: "wrong output, no errors" — and the detection mechanism for that class is business-layer monitoring and anomaly detection on output values, not infrastructure health checks.
Finally: treat code comprehension as a skill that requires deliberate practice, not a byproduct of code authorship. Loupe's framing is correct. The cognitive muscle for reading a function and identifying what it actually does — not what the surrounding context implies it should do — atrophies when developers spend their review time reading test output rather than logic. The teams that will maintain production quality under heavy AI-agent usage are the ones where engineers practice reading code adversarially, as a regular activity, not as an occasional response to an incident.
The Review Step Is Now the Bottleneck
For most of software engineering history, the bottleneck was writing code. AI agents have moved that bottleneck. Code generation is nearly free and extremely fast. Code comprehension and verification did not get easier at the same rate — and under the conditions Kim describes, they are happening less, not more, because green test output creates an exit condition for review that developers reasonably accept.
The teams that recognize this shift early will treat code review not as a quality gate that CI automates away, but as the primary engineering skill of the AI-agent era. That means investing in the ability to read unfamiliar code critically, to reconstruct intent from logic without relying on comments or tests, and to catch the class of bugs that are invisible to every automated tool precisely because they are logically consistent.
Loupe is a first tool for training that skill. It will matter more, not less, as agents become faster and more capable — because the faster an agent generates code, the more AI-written functions land in review queues per engineer per day, and the more pressure there is to approve quickly and move on. Deliberate practice is the only countermeasure that scales.
A green CI pipeline tells you the code does what the agent thought you wanted. Someone on your team still needs to know whether that's the same thing.
Sources & Editorial Disclosure
This article was researched and written with AI assistance (Claude by Anthropic) as part of StackRadar's automated editorial pipeline. Content was synthesised from the following public developer community sources: Dev.to.
All technical claims, version numbers, benchmarks, and project details should be independently verified against official documentation or the original sources listed above. StackRadar analyses and synthesises publicly available information and does not claim original authorship of the underlying events, projects, or research described. Mention of any project, product, or organisation does not constitute an endorsement by StackRadar. This content is provided for informational purposes only — 2026-06-21.