← Back to VersionGopher
VersionGopher Blog

VersionGopher Looked Back

How a new package-risk scanner found vulnerabilities in the AI development environment building it

Published May 25, 2026 · Public page · Accessed 2026-06-01

I was adding a new capability to VersionGopher: scanning for vulnerable npm and PyPI packages.

Exciting, I know. Another bullet point on a roadmap. Except this one actually matters, because package-level supply-chain risk has quietly become one of the more serious problems in software — and almost nobody treats it that way because the fix usually involves looking at dependency trees, which is the intellectual equivalent of reading the phone book in a language you almost but don't quite speak.

Modern applications are not just the code in a repository. They are the package managers, virtual environments, dependency caches, build tools, bootstrap scripts, transitive dependencies, and — increasingly — the AI coding agents that stitch it all together while everyone politely agrees the development environment is "temporary" and therefore not their problem. It's Schrödinger's attack surface: it doesn't exist until it does, at which point it's been there for six months.

That is the environment VersionGopher is built to inspect.

Most environments look clean from a distance. The app runs. The tests pass. The agent says the work is done. Everyone high-fives, merges to main, and moves on with their lives. VersionGopher is built for the part after that: the part where you actually look. The glitch in the Matrix, the tear in the simulation, the little technical artifact that gives away what's actually installed, cached, bundled, or quietly rotting underneath the surface.

While building this package-risk capability, VersionGopher found exactly that kind of glitch — inside the development environment being used to build the feature.

I want to be clear: this was not some contrived test fixture or vulnerable-by-design lab where I already knew the answer before clicking scan. It was not a handpicked demo folder designed to make the tool look impressive at a conference booth. As it turns out, modern development environments do not need any assistance generating their own security findings. They're remarkably self-sufficient in that regard. This was the actual agent-created development environment — the packages, virtual environments, caches, and tooling that got pulled down during normal development. The agent created the environment, VersionGopher scanned it, and the findings were right there, like a cat sitting next to something it had just knocked off the counter, completely unrepentant.

That's when the feature stopped being an abstract "package scanning" capability and became a practical feedback loop. VersionGopher was doing what it was built to do: looking past the surface presentation of a system and finding the technical tells that reveal what's really there.

Think of it as the software equivalent of noticing the sixth finger in an AI-generated image. Everything looks plausible until one small artifact gives it away. Except here the sixth finger has a CVE number.

Agents move fast, which is both the point and the problem

I use coding agents because they can move through a lot of work quickly. They patch files, update dependencies, run tests, change UI language, modify setup scripts, and grind through repetitive implementation work that would otherwise consume hours of my life I will never get back. Useful. Genuinely useful.

Also: speed and correctness are not the same thing, and confusing them is how you end up deploying exciting new vulnerabilities at unprecedented velocity.

I don't just let an agent make changes and assume the result is correct because the tests pass. When there's a meaningful change, I spot-check the dashboard, click into strange rows, compare what the tool says against what the environment should actually look like, and ask whether the result is technically true but operationally misleading. It's the kind of work that never makes it into a demo reel because "person squints at dashboard for ten minutes" doesn't drive engagement metrics.

A lot of real issues live in those awkward gaps where a machine can say "yes, this is working" while a human looks at the screen and thinks "that is not quite what we meant." The machine isn't lying. It's just answering a different question than the one that matters.

That review loop still matters. Maybe someday agents will replicate that kind of skeptical judgment. Today they mostly replicate confidence, which is a very different thing. AI confidence is a twenty-three-year-old in skinny jeans and dress sneakers walking around the Pentagon telling four-star generals what to do without knowing where the bathrooms are. It's not that the confidence is always wrong — sometimes the kid has a point — but the ratio of certainty to actual knowledge is so wildly miscalibrated that you need to gut-check it constantly or people end up making decisions based on vibes with a sans-serif font. Skepticism requires effort, context, and the willingness to look at something that appears fine and say "I don't believe you." Confidence requires nothing. Confidence is the factory setting. Every agent ships with it pre-installed and there's no uninstall option, which means the human in the loop occasionally has to be the one who punches confidence in the stomach and asks it to show its work.

The useful pattern is: human skepticism + agent execution + a tool that preserves evidence. The agent can move fast. Somebody still has to ask the annoying questions.

The first problem wasn't a vulnerable package. It was a wording problem.

The first issue VersionGopher surfaced was actually a UI and triage problem, which is fitting because the most dangerous bugs in security tooling are the ones in the communication layer.

The dashboard showed package risk, but the malicious-package filter was disabled. At first glance, that looked wrong. If there are package findings, why is the malicious-package view greyed out? Did something break? Is the scanner confused? Does it need a hug?

The answer was that the findings were vulnerability advisories, not known malicious-package advisories. That distinction is not cosmetic. It's load-bearing.

A vulnerable package is not the same thing as a malicious package. A vulnerable dependency means you probably need an upgrade, an exposure analysis, or a compensating control. A malicious package means you probably need a provenance review, containment thinking, possible credential rotation, and a large coffee — possibly Irish. "Bump the version" and "assume you've been compromised" are different playbooks. Confusing them is how you get security teams that respond to everything with the same dead-eyed shrug, which is not the operational posture you want when someone actually slips a backdoor into your dependency tree.

So the UI had to say what it actually meant. The agent fixed that. Package advisories became their own category. Malicious packages stayed reserved for actual malicious-package findings. The dashboard language was changed so it didn't accidentally imply that ordinary vulnerable dependencies were malware, which is the kind of wording mistake that gets screenshots posted in Slack with increasingly creative emoji reactions.

This may sound like a small wording change, but security tools do real damage when they blur categories. If every finding sounds like a five-alarm fire, operators stop reading the alerts, and at that point the attackers don't even need to be clever. They can just wait for alert fatigue to finish the job for them. Alert fatigue is a more reliable exploit than most zero-days.

VersionGopher needs to be precise about what kind of evidence it has. If the tool cries wolf, the village gets eaten. This is not a metaphor I'm being precious about; this is how actual incident response degrades in production.

Then VersionGopher found a vulnerable package in a cache

The first concrete package finding was a Python package sitting in a package-manager cache, minding its own business, harming no one — allegedly.

VersionGopher identified it and matched it to a real advisory. Good. That part worked. But the path context was the real story. The package existed on disk, but it wasn't clearly an active application dependency. It was a cached artifact — the kind of thing that accumulates like geological sediment in any environment that's been alive longer than a week. Everyone has that one drawer in their kitchen full of takeout menus, dead batteries, and a phone charger for a phone they haven't owned since 2019. Package caches are that drawer, except the dead batteries occasionally have CVEs.

That means the package identity was strong, but runtime exposure was not confirmed.

The wrong interpretation: "The application is definitely running this vulnerable package. Sound the alarm. Wake up the on-call. Cancel brunch."

The better interpretation: "This vulnerable package exists in a package cache. Verify whether that cache feeds an active environment or build path before you ruin anyone's weekend."

That difference matters. A vulnerable package in a cache is not the same as a vulnerable package imported by a running service. On the other hand, caches feed builds, builds feed deployments, and old artifacts have a wonderful habit of resurrecting themselves at the worst possible moment, like a horror movie villain who wasn't actually dead despite everyone clearly seeing them fall off the building. So it's not something to ignore, either.

The UI was improved to reflect this distinction. The finding now separates confidence in package identity from confidence in runtime exposure. It explains that a cache artifact is real evidence, but not the same as proof that the vulnerable package is currently active.

That's a core VersionGopher principle:

Path context is security context.

The same package metadata means different things depending on whether it's in a cache, a virtual environment, a toolchain, an archive, or a production application directory. Flattening all of that into one red bucket is how vulnerability management becomes noise — and noise is how real findings go unnoticed, which is how you end up explaining to your CISO why the vulnerability scanner found it six months ago and nobody did anything because everything was always red.

Then it found a real installed dependency

The next finding was more direct.

VersionGopher found a vulnerable SSH-related Python dependency inside the project's web application virtual environment. That was not a cache artifact. That was an installed package, in the development environment, right there in site-packages, paying rent.

The agent triaged the usage and found that the dependency was tied to an optional remote-scanning path rather than every normal execution path. Good news: it wasn't loaded on every startup. Less good news: it was still installed, and "optional" has a way of becoming "always on" the moment someone enables a feature flag without reading the release notes. Which is to say: always. In the history of software, no feature flag has ever remained off. Feature flags are just "on" buttons with a delay timer and a liability waiver.

This is the kind of distinction I want the product to make. "Installed somewhere" is useful evidence. "Loaded by the main runtime every time the app starts" is different evidence. "Only used if an optional feature is enabled" is different again. These are not the same risk, and a tool that treats them identically is a tool that trains its operators to ignore it.

The fix was straightforward: the dependency was moved to a safer major version, the requirements were updated, the operator guidance was updated, the compatibility caveat was documented, and a guard was added so the project doesn't quietly drift back to the older vulnerable range later. Because that's what dependencies do. They drift. Slowly, silently, like a boat you forgot to tie to the dock.

One practical caveat: the newer version of the dependency removes old SHA-1-era behavior, which can break truly legacy systems. That's fine. If someone needs a legacy exception, that should be an explicit decision made with eyes open and a comment in the code explaining why. It should not be silently preserved as the default security posture because some ancient endpoint somewhere might get its feelings hurt. Legacy compatibility is a valid engineering choice. Accidental legacy compatibility is a vulnerability with extra steps.

This is the loop I want VersionGopher to drive:

  1. Find the evidence
  2. Classify the context
  3. Determine the exposure
  4. Fix the dependency
  5. Verify the old artifact is gone
  6. Add a guard so it doesn't come back

That is significantly more useful than dumping a CVE count on the screen and calling it a day, which is the current state of the art for a depressing number of "security" tools.

Then it found vulnerable tooling

The most interesting finding was not in the application virtual environment at all.

VersionGopher found a vulnerable package inside the local Python toolchain used by the development environment. Not part of the deployed application. Part of the tooling used to build, install, and test the application.

"But that's not the app." Sure. And the contractor who has the keys to the building isn't technically an employee, either. Feel safer?

Developer environments are part of the supply chain. CI runners are part of the supply chain. Package managers are part of the supply chain. Local bootstrap tools are part of the supply chain. AI coding agents that create environments and install dependencies are also part of the supply chain, even if everyone would prefer not to think about that because it complicates the "the agent saved me forty minutes" narrative. The agent saved you forty minutes and installed a vulnerable dependency in your toolchain. Both things are true. You're welcome.

After being reprimanded by VersionGopher and me, the agent fixed the active installed package, verified that the old metadata was gone, and updated the bootstrap scripts so stale tooling wouldn't keep getting recreated. Then it slammed its door and wouldn't talk to me for a bit.

Coding agents are adolescents right now. They can be difficult, sloppy, and wildly overconfident about things they just learned five minutes ago. That means parenting is critically important. You have to set boundaries, check their work, and occasionally ground them when they install vulnerable packages in your toolchain and act like nothing happened. Free-range agents might be popular in the fruity parts of California where someone's Series A deck says "autonomous AI workflow" in 72-point font, but not here. Here we supervise.

That last part — updating the bootstrap scripts — is important. A lot of dependency "fixes" aren't real fixes because the next setup script recreates the vulnerable environment five minutes later. You haven't remediated the issue. You've cleaned one copy of it and left the mold in the wall. Next time someone runs ./setup.sh or pip install -e . or whatever bootstrap ritual the project uses, congratulations — you've got a fresh copy of the vulnerability, still warm from the oven.

VersionGopher helped expose that.

The agent also noticed a remaining inactive seed artifact bundled with the interpreter itself. Not the active installed package anymore, but potentially capable of reintroducing the older version under certain bootstrap conditions. That's another classification VersionGopher should learn to express clearly:

Bundled installer seed, not active installation.

Again — path context matters. It's not enough to know that a version exists somewhere on disk. You need to know what role that file plays. Is it an active dependency? A cached artifact? A seed wheel that ships with the interpreter? A leftover from last Tuesday that nobody cleaned up because rm -rf felt too permanent?

That's the glitch in the Matrix. Not the dramatic kind with a black cat walking past twice and Keanu looking concerned. The boring kind that actually matters: stale metadata, bundled wheels, dependency residue, and setup paths that quietly recreate the problem you thought you fixed. The boring glitches are the ones that bite, because nobody makes an action movie about pip cache management.

Why I'm building the package-risk feature

The important thing here is not that VersionGopher found a few package advisories. Plenty of tools can match package names and versions against advisory databases. That's table stakes. That's grep with a subscription fee.

The important thing is that VersionGopher helped create a useful operational loop.

It found package evidence in three different layers:

LayerWhat was foundExposure status
Package cacheVulnerable package artifact presentRuntime usage not confirmed
Application venvVulnerable installed dependencyTied to a specific optional feature path
Development toolchainVulnerable installed tooling packageRelevant to build and setup behavior

Those are not the same thing. Treating them like the same thing makes a dashboard look busy and important, but it doesn't help anyone make a good decision. It just helps someone make a fast decision, which — see above — is not the same thing.

A useful scanner should help answer: What was found, where was it found, how confident is the identity, how confident is the exposure, and what should happen next. It should preserve raw evidence, but it should also give the operator enough interpretation to avoid chasing ghosts or — worse — ignoring real problems because they're buried in a pile of ghosts.

That's the difference between vulnerability matching and triage. Matching is a solved problem. Triage is where tools earn their keep or become expensive noise generators.

The agent didn't magically discover and fix everything by itself

I want to be clear about this, because it's easy to tell the story wrong, and the wrong version of this story is exactly the kind of thing that ends up on a LinkedIn post with a rocket emoji and a paragraph that starts with "I'm humbled to announce." Nobody is humbled. You are not humbled. The agent is not humbled. The agent doesn't have feelings, and if it did, humility would not be the one it shipped with.

The agent did useful work. It made code changes, updated dependencies, adjusted UI language, modified setup scripts, and ran validation. That's exactly the kind of work agents are good at: well-defined, scope-limited, grindable. Think of it as a very fast intern who never complains, never eats your lunch from the fridge, and also never once asks "are you sure?" — which is both the best and worst thing about it.

But the important discoveries came out of the review loop.

I was looking at the results. I was questioning the dashboard. I was challenging ambiguous language. I was asking whether a finding was a cache artifact, an active dependency, or a toolchain issue. I was pushing the agent to explain why something was present and whether the fix actually removed the old artifact or merely made the current screen look cleaner. "The dashboard looks right now" and "the problem is actually fixed" are different assertions, and I wanted the second one. Agents love the first assertion. It's their favorite. "Looks good to me" is not a security posture, it's a pull request comment from someone who opened the diff for three seconds.

That's the part agents don't reliably replicate today. They're good at doing what they're told. They are not always good at knowing what should bother them. Agents have no sense of unease. Have you ever watched a dog wearing shoes for the first time? That cautious, confused, something-is-wrong-but-I-can't-articulate-it energy? Agents don't have that. They don't get that nagging feeling when something looks right but doesn't feel right. They don't stare at a row in a table and think "wait, why is that there?" They just move to the next task with the serene confidence of someone who has never once been haunted by a production incident at 2 AM and frankly doesn't understand what all the fuss is about.

The power is in the combination: human skepticism, agent execution, and tooling that makes the evidence visible. Take away any one of those three and the loop breaks.

The pattern

The pattern I want VersionGopher to enforce is simple:

  1. Scan — Find what's actually there
  2. Explain — Say what it is, clearly, without conflating categories
  3. Triage — Distinguish cache from active from toolchain
  4. Fix — Remediate the actual dependency, not just the dashboard
  5. Verify — Confirm the old artifact is actually gone
  6. Prevent regression — Make sure it can't silently come back

That's what happened here.

VersionGopher found package evidence. The dashboard language was improved. Cache artifacts were distinguished from active installs. A real vulnerable app dependency was upgraded. A vulnerable toolchain package was remediated. Bootstrap scripts were hardened. Tests were run. Regression guards were added.

That is what practical security tooling should do. Not just yell "vulnerable!" into the void like a car alarm that everyone ignores. Not just produce a giant SBOM and walk away like a co-worker who drops a 400-page report on your desk, says "you should read this," and leaves for a two-week vacation. Not just show a CVE count like the number itself is the deliverable — as if anyone ever said "we had 847 findings and now we have 846, ship it, we're done here, great quarter everyone."

A good tool should help answer:

That's where package-risk scanning becomes operationally useful instead of just another dashboard full of red badges that everyone learns to ignore, which is the current endpoint of approximately 90% of all security tooling procurement.

The punchline

I started adding npm and PyPI vulnerability scanning to VersionGopher because package-level risk is becoming one of the defining security problems in modern software.

Then, while building that exact capability, VersionGopher found vulnerable packages in the AI agent's own development environment.

That is the best kind of dogfooding. Slightly annoying, entirely humbling, and genuinely useful — which is usually how you know a security tool is doing something real, as opposed to something marketable.

The agent had pulled down packages. VersionGopher found the weak spots. I challenged the results. The agent fixed the issues. The UI got better. The dependency posture got better. The bootstrap process got better. Everyone learned something, except the agent, which learned nothing because it doesn't retain state between sessions, which honestly might be the most relatable thing about it. We've all had Mondays like that.

The scanner found a tear in the simulation, and in this case the simulation was the development environment building the scanner. That's not a neat demo. That's a Tuesday. And Tuesday is exactly when your security tooling should be working.

That's the loop.

And that's why VersionGopher exists.