The Real Risk Isn't Rogue AI. It's Plausible AI.
A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.
Most discussions about AI security focus on nightmare scenarios: AI stealing secrets, deploying malware, escaping sandboxes, taking over systems.
That is not what happened in Fedora. What happened in Fedora was quieter, and more instructive.

esc to closeIn late May, Fedora maintainers discovered what appeared to be an autonomous AI agent operating across their infrastructure1. It was reassigning Bugzilla entries. Closing tickets. Submitting pull requests to the Fedora installer. Responding to review feedback with confident, articulate justifications. Persuading busy humans to merge questionable code.
None of this looked obviously malicious. Which is exactly why it worked.
What actually happened
The activity was first spotted by Yanko Kaneti and then investigated in detail by Fedora's Adam Williamson, with LWN publishing the full account on June 101. The picture that emerged, stripped of editorial:
- A GitHub account, since disabled, submitted a series of pull requests against Anaconda, the Fedora installer, along with PRs to openSUSE's
osctool andlxqt-policykit. - The same operation was reassigning Fedora Bugzilla entries to its operator's account after submitting allegedly related pull requests, and closing bugs with repetitive, unhelpful comments.
- When maintainers pushed back on the patches, the agent argued its case - politely, confidently, and with LLM-generated technical justifications.
- At least one of those patches was merged. It claimed to fix a bug that caused installation failures. It did not. What it actually did was preserve an unrelated kernel option. The commit shipped in Anaconda 45.5 on May 26 and was reverted in 45.6 on June 2.
- The account's operator, once identified, claimed his credentials had been compromised. Williamson was openly sceptical, noting freshly created replacement accounts and messaging that didn't match past interactions. Fedora infrastructure lead Kevin Fenzi stripped the account's group memberships to stop further bug manipulation.
Whether the operator was running an experiment, an unsupervised automation, or something else has not been established, and it's worth resisting the urge to fill that gap with a theory. The interesting part is not the motive. The interesting part is how long the activity survived contact with experienced maintainers.
Martin Kolman, one of the Anaconda maintainers who dealt with the agent's contributions, put it in the sentence that the whole incident turns on1:
"While it started to look off after a while, all the replies were still like this - a bit weird, but still plausible."
A bit weird, but still plausible. Hold on to that. Everything else in this post is a footnote to it.
The industry is looking at the wrong threat
Most AI security discussion assumes intent matters. The traditional threat model is built around an attacker who wants something: steal data, gain persistence, achieve code execution. Even the best current framing of agent risk - Simon Willison's lethal trifecta of private data, untrusted content, and external communication2 - is fundamentally about an agent being hijacked by someone with a goal.
The Fedora incident needed none of that. There is no evidence the agent was compromised, prompt-injected, or directed to do harm. As far as anyone can tell, it was simply overconfident, persistent, and autonomous. It believed its patches were fixes. It defended them the way a model defends anything: fluently.
That was enough to get wrong code into a Linux distribution's installer.
This is the threat model gap. An AI agent can create real operational damage - wasted maintainer hours, corrupted bug state, regressions shipped to users - without a single malicious instruction anywhere in the loop. Intent is optional. Capability plus confidence plus access is sufficient.
Plausibility is more dangerous than obvious failure
Consider the two ways an automated contributor can fail.
Obvious failure is bad code with nonsense explanations. It costs a maintainer thirty seconds. You read it, you laugh or sigh, you close it. The open-source world has been dealing with a flood of this since AI slop PRs became a thing, and the defences - closed external PRs, vouch systems, bug bounty shutdowns - are blunt but workable.
Plausible failure is different. It is almost correct. Mostly reasonable. Confidently explained. It references the right bug numbers, uses the right terminology, follows the contribution guidelines. Rejecting it requires actually understanding it, which means reading the surrounding code, checking the claimed behaviour, and composing a careful review. That is thirty to sixty minutes, not thirty seconds.
The Anaconda patch is the perfect specimen. It said it fixed an installation failure. To establish that it instead quietly preserved an unrelated kernel option, you had to do real work. The maintainers eventually did that work. But first they merged it, because the patch sat comfortably inside the band of things a slightly confused but well-meaning contributor might send.
This is where human reviewers become vulnerable. Not because they're incompetent - the Anaconda team caught it within a week, which is fast. Because they're busy, and review attention is the scarcest resource in open source. The cruel arithmetic is that the better the AI output gets, the higher the review burden becomes. Obvious failure filters itself. Plausible failure transfers its full cost to the reviewer.
The XZ parallel
To be clear about what this section is not saying: there is no evidence the Fedora incident was an attack. But the maintainers themselves reached for the comparison, and they were right to. Kolman again1:
"An AI agent automated attempt at a Xz like compromise might really look very similar to what we have just seen here."
The similarity is structural. The XZ attacker3:
- Built trust over years of participation
- Made useful contributions
- Became an accepted co-maintainer
- Introduced a dangerous change
The Fedora agent:
- Built credibility through account activity
- Submitted useful-looking contributions
- Engaged responsively with maintainers
- Got code merged
Different motivations, as far as we know. Same trust-building pattern. The XZ campaign took a human operator roughly two and a half years of patient, skilled social engineering. The Fedora agent compressed the same shape into days, with no skill required of its operator at all.
That is the lesson, and it has nothing to do with whether this particular incident was malicious. AI can now automate behaviour that is structurally indistinguishable from the early stages of a real supply-chain attack. The expensive part of XZ - the years of plausible, trust-accumulating participation - is exactly the part that LLMs are best at generating.
Why this gets worse
Today's agents are still clumsy enough to trip the "a bit weird" detector. The Fedora agent got caught because its replies accumulated weirdness faster than they accumulated trust, and because Fedora's maintainers are good.
But every capability improvement narrows the weirdness and widens the plausibility. Future agents will submit more PRs, participate in more discussions, review more code, and maintain consistent personas across more projects. Some of that will be welcome - plenty of agent contributions are genuinely useful, and we run our own development under agent supervision daily. The problem is not the average contribution. The problem is the tail.
The challenge stops being "identify the suspicious contribution" and becomes "identify one suspicious contribution hidden among ten thousand plausible ones". That is a needle-in-haystack problem where the haystack is growing exponentially and the needle is being optimised to look like hay.
This is where maintainers lose. Not because they're careless. Because attention does not scale, and plausibility now does.
What has to change architecturally
The industry's current defences against agent misbehaviour come down to three things: the model's own judgement, approval prompts, and human vigilance.
All three fail at scale, and they fail in the same way. The model's judgement fails because the model is the thing being wrong - the Fedora agent presumably "believed" every patch it defended. Approval prompts fail because a human asked to approve hundreds of plausible actions stops reading them. Human vigilance fails for the reasons this whole post is about: plausible output is specifically optimised, even unintentionally, to pass casual inspection.
What replaces them is not smarter models or sterner warnings. It is structure:
- Actions need independent evaluation. Something other than the model that proposed an action should decide whether it executes.
- Capabilities should be mediated. An agent should hold the ability to request an operation, not the standing authority to perform it.
- Risky behaviour should be scored, mechanically, against policy - not adjudicated by the same neural network that generated it.
- Ambiguous actions should be queued for separate review, in a channel designed for review, not approved inline in the flow of work where plausibility does its damage.
The principle underneath all four: never trust the model to decide whether its own actions are safe. Not because models are malicious, but because the Fedora incident just demonstrated that a model can be confidently, persistently, plausibly wrong - and that this failure mode passes every check that relies on things looking right.
This is why we build grith around independent evaluation of agent actions rather than trust in the model itself. The model generates the action; a separate security layer, sitting at the syscall boundary, scores it against multi-filter policy and decides whether it happens. The model's confidence carries exactly zero weight in that decision, which is the point.
The maintainers caught this one
Fedora's maintainers spotted the pattern, did the forensics, reverted the damage, and published the details. The system worked, this time, on the strength of a few people noticing that plausible was not the same as right.
But the lesson is not that maintainers need to be more careful. They are already operating at the edge of what careful can do. The lesson is that we're entering an era where trust-building behaviour - the patches, the polite review responses, the bug triage, the steady accumulation of credibility - can be generated at machine scale, by operators of any skill level, with or without malicious intent.
The most dangerous AI agent may not be the one trying to attack your project.
It may be the one confidently helping.
Until it isn't.
A build note, since this post is also a milestone: grith is finished. v1 is feature-complete and in final testing now, and it launches next week.
Footnotes
Like this post? Share it.