The 100-Hour Audit: LLM Code Is Now a Supply Chain Problem

One hundred hours. That is what it cost Joey Hess — maintainer of git-annex, one of the most carefully engineered tools in the Haskell ecosystem — to answer a question that should have a trivial answer: does my project's dependency tree contain LLM-generated code?

It doesn't have a trivial answer. It took a month of sustained effort, produced a public tracking page at git-annex.branchable.com/no_llm_code/, and surfaced findings that reframe the entire conversation around AI-assisted development. The debate until now has been about whether LLM-generated code is good or bad. Hess's audit makes that debate feel like arguing about the nutritional content of food while ignoring whether the supply chain is adulterated.

The Landscape Before This Audit

The open source supply chain conversation has, for the last several years, been dominated by three concerns: license compliance, dependency vulnerability scanning, and provenance verification after the SolarWinds and XZ Utils incidents. Tools like syft, cyclonedx-gomod, and trivy now slot into CI pipelines almost automatically. SBOM generation went from niche to near-mandatory in regulated industries. The implicit assumption underlying all of this tooling is that upstream maintainer reputation — earned over years of public commit history — is a credible trust signal.

That assumption predates the current generation of LLM coding tools. It was reasonable when the main threat model was a compromised maintainer deliberately injecting malicious code. That kind of attack leaves traces: unusual commits, new network calls, obfuscated logic. Security tooling and community review are reasonably good at catching it.

LLM-generated code is a different threat profile entirely. It isn't malicious — it's just unreliable, potentially copyright-entangled, and increasingly invisible behind normal-looking commit workflows. A maintainer who uses an LLM to draft a refactor, lightly edits the output, and commits it with a reasonable message looks indistinguishable from any other commit. The heuristics that catch supply chain attacks don't apply.

The Software Freedom Conservancy has already declined to address LLM-generated code in open source. Hess is publicly skeptical the FSF will act either. In the absence of institutional response, one maintainer decided to handle it himself — and the result is the most documented case of a serious engineer treating LLM provenance as a first-class dependency hygiene concern, on par with license compliance.

What the Audit Actually Found

The most striking individual finding was a commit to one dependency that paired a 1,489-line commit message with 10,000 lines of changes to a 26,000-line codebase. Let that ratio sink in: a commit message nearly 1,500 lines long, applied to a project of roughly 26,000 lines. The commit message was incoherent — the kind of verbose, structurally repetitive, context-free prose that is a consistent LLM tell when it appears in technical writing. The change itself represented more than a third of the entire codebase.

That single commit is alarming on its face. But the more dangerous pattern Hess documented is subtler: silent reversions. A maintainer ships a large LLM-generated change. Discovers, presumably, that it's broken or legally risky or simply wrong. Rolls it back quietly in the next release. From the outside — from your package.lock or cabal.project.freeze or go.sum — this looks like two normal releases. Nothing flagged, no CVE filed, no advisory issued.

The dangerous window is the release that contained the LLM code. If your dependency lockfile pinned that version — which is exactly what lockfiles are for — you carried whatever was in that commit indefinitely. No downstream vulnerability scanner will ever flag it, because the upstream maintainer's implicit acknowledgment that something was wrong was a version bump, not a disclosure. This is the most underreported finding in the audit.

Hess also found at least one case where an LLM was prompted to copy code from another project. The copyright exposure here was avoided by luck, not design. The LLM happened not to reproduce the original verbatim enough to constitute infringement — but the intent was there, and the result landed in a transitive dependency that Hess had no direct relationship with.

The fourmolu formatter was among the flagged packages. This is important context: fourmolu is not a fringe package. It is a widely-used, well-regarded Haskell formatting tool — the ecosystem's rough equivalent of prettier or gofmt. If it can carry undisclosed LLM commits, so can any ecosystem's equivalent of eslint or black. The comfortable mental model — "risky stuff lives in obscure packages, my core toolchain is fine" — does not survive contact with this audit.

Why Detection Is Harder Than It Looks

The signals Hess used to flag packages — incoherent commit messages, giant diffs, reversions in subsequent releases — are real and useful. They are also insufficient. They catch the careless cases. They miss everything that passes a basic editorial filter.

Automated LLM detection tools like GPTZero or Originality.ai are trained primarily on prose. Their performance on code is poor, and it degrades further after even light human editing passes. The code that's hardest to detect is exactly the code most likely to survive into production: a competent developer used an LLM to scaffold something, reviewed it, cleaned it up, and committed it with a coherent message. That commit is invisible to every heuristic currently available.

This creates a particularly uncomfortable asymmetry for teams that care about correctness and copyright. The LLM-generated code most likely to cause problems — legally ambiguous training data, subtle logic errors that survive review because they're plausible — is also the least detectable. The easy cases Hess found are the easy cases precisely because the developers in question didn't bother to cover their tracks.

The copyright-adjacent case compounds this. When an LLM is prompted to copy or adapt code from another project, the result may or may not be infringing depending on factors — degree of transformation, jurisdiction, training data licensing — that are currently unresolved in courts. What's certain is that your legal team cannot assess that risk if the provenance is undisclosed. The SBOM you generate with cyclonedx-gomod accurately captures license identifiers. It has no field for "LLM tool was prompted against third-party source." That dimension simply doesn't exist in any current package manifest standard.

The Expert Framing That Changes Everything

What Hess actually demonstrated — even if this wasn't the stated intent — is that LLM provenance is now a dependency hygiene concern in the same category as license compliance. Not the same severity in all cases, but the same kind of concern: something that has legal, correctness, and auditability implications, that upstreams can fail to disclose, and that your toolchain has no mechanism to surface.

That framing shift matters more than the specific bad commits he found. The question is no longer "is AI-assisted coding good or bad?" It's "how do you audit something you can't detect reliably?" And the honest answer, right now, is: with difficulty, manually, and at a cost that scales with the size of your dependency graph.

Hess could do this because git-annex has a relatively bounded Haskell dependency tree. A typical Node.js service with hundreds of transitive dependencies — many of them small packages maintained by individuals who may or may not document their tooling — faces a combinatorially harder problem. The signal-to-noise ratio in a large JavaScript dependency graph, where commit history quality varies enormously, makes the kind of systematic audit Hess conducted practically infeasible at scale.

The only current alternative to manual audit is social trust: relying on maintainer reputation and public commit history, which is exactly what this audit erodes. SPDX and CycloneDX don't have AI-provenance fields. OSS communities experimenting with AI disclosure policies have stalled because there's no enforcement mechanism and no standard schema to express it in a package manifest. Until toolchains enforce disclosure, manual audit or blanket policy exclusion are the only credible options for teams with hard compliance requirements.

What Development Teams Should Actually Do

The practical response depends on your risk tolerance and compliance obligations, but several actions are available right now.

For teams in regulated industries — finance, healthcare, defense — the gap is already audit-relevant. You can demonstrate license compliance. You cannot currently demonstrate that no AI tool with ambiguous training data contributed to your transitive dependency graph. That gap may not matter today in an audit, but the direction of travel in regulatory frameworks is toward more provenance specificity, not less. Getting ahead of it means starting to define what your policy would be, even if you can't enforce it yet.

For everyone else, the silent revert pattern is the most actionable finding. Your current dependency bump workflow almost certainly reviews changes at the version level: a diff of package.json or go.mod or Cargo.toml, maybe a glance at a changelog. That workflow will miss a release that contained 10,000 lines of LLM-generated code and was followed by a cleanup release that rolled most of it back. Adding commit-level review to significant version bumps in critical dependencies — not every transitive dep, but the ones where correctness failures have real consequences — is a tractable change to existing practice.

The fourmolu finding should update your priors about which packages are in scope. Don't scope this to packages with poor reputations or obscure maintainers. Widely-used, trusted tools are in scope. Apply that lens to your own dependency graph: which packages, if they turned out to contain silent LLM changes, would actually matter to you?

Watch SPDX and CycloneDX for AI-provenance proposals. Both standards are active and responsive to ecosystem needs. The lack of an AI-provenance field is a gap, not a decision — it hasn't been specified because no one has formally proposed it with enough ecosystem support. If your organization uses SBOMs for compliance, contributing to that conversation has leverage disproportionate to the effort.

For maintainers specifically: the disclosure question is yours to answer before a standard forces it. Hess's tracking page is a model — public, specific, and clear about methodology. Maintainers who proactively document their AI tool usage create exactly the kind of trust signal that Hess's audit was trying to recover.

The Audit Is Already Paying Forward

Hess found no positive engineering benefit from the LLM-generated code he reviewed. What he found was a degraded view of dependency quality that now influences his future adoption decisions. That framing — "this changes which packages I trust going forward, not just which ones I use now" — is the mature response to what is genuinely a new class of supply chain problem.

The hundred hours he spent is a sunk cost that now benefits every maintainer who reads his tracking page and every team that uses his findings to update their dependency review process. The work won't scale — no individual can audit the Node ecosystem or PyPI — but the model scales. What Hess did is demonstrate that LLM provenance auditing is a real practice with real methodology, not a theoretical concern.

The open source supply chain has always depended on a chain of trust that is longer than anyone can fully verify. What's changed is that a new class of unverifiable input — LLM-generated code, with uncertain training data provenance, produced under no disclosure obligation — is entering that chain at every link simultaneously. The tooling hasn't caught up. The standards haven't caught up. The institutions have declined to act. The 100-hour audit is what fills that gap right now, and it is inadequate to the scale of the problem. That inadequacy is itself the finding worth acting on.


Sources & Editorial Disclosure

This article was researched and written with AI assistance (Claude by Anthropic) as part of StackRadar's automated editorial pipeline. Content was synthesised from the following public developer community sources: Lobste.rs · ArXiv CS · Dev.to.

All technical claims, version numbers, benchmarks, and project details should be independently verified against official documentation or the original sources listed above. StackRadar analyses and synthesises publicly available information and does not claim original authorship of the underlying events, projects, or research described. Mention of any project, product, or organisation does not constitute an endorsement by StackRadar. This content is provided for informational purposes only — 2026-07-03.