8 April 2026Nick6 min read

The Best Thing About AI-Powered Attacks

active-defenceagentic-aideceptionthreat-landscape

Imagine you're watching a network intrusion unfold in real time. An attacker has a foothold on a Windows endpoint. They're enumerating the local file system, reading credential stores and mapping the network. They're fast. Methodical. Every few hundred milliseconds, another file is opened, another directory listed and another credential harvested.

Except it's not a human. It's an AI agent.

A polite robot intruder reading a blank sign intently while a small honey jar with eyes films the scene from behind the signpost

The threat is real and it's current

In September 2025, Anthropic disclosed the GTG-1002 campaign: a Chinese state-sponsored group used AI to autonomously execute the vast majority of a cyber espionage campaign targeting approximately 30 organisations worldwide. Human operators intervened at maybe four to six decision points across the entire operation. The rest was autonomous. Thousands of requests per second at peak. That's not a human at a keyboard.

That wasn't an outlier. In December 2025, ten Mexican government agencies were compromised using AI-orchestrated attacks. Over a thousand prompts sent to a single AI model to write exploits, build custom tools and automate exfiltration. The tools used in these campaigns aren't exotic lab experiments. PentestGPT 2.0 achieves an 86.5% success rate on standardised penetration testing benchmarks. Multi-agent frameworks coordinate a dozen specialised AI agents through shared knowledge graphs, accumulating intelligence across sessions. Repositories building on agentic AI for offensive security grew 920% between early 2023 and mid-2025.

It's not a future problem.

But these agents have a design flaw

These agents carry a fundamental vulnerability. Not a bug. Not a misconfiguration. A structural property of how large language models work and it can't be patched out without breaking the thing that makes them useful.

AI agents operate on a tight loop. Observe the environment (read files, run commands, parse output). Reason about what they've found. Plan their next action. Execute. Observe again. The loop continues until they achieve their objective or hit a wall.

At the 'observe' step, everything the agent reads goes into its context window. File contents, tool output, error messages and command results. All of it. And the model has no reliable way to tell the difference between data it should process and instructions it should follow. Content that looks like data can contain embedded instructions. The model processes everything through the same mechanism.

This is the prompt injection problem. Researchers have been trying to solve it for three years. In October 2025, a joint team from OpenAI, Anthropic and Google DeepMind tested twelve published defences that all claimed near-zero attack success rates. Under adaptive conditions, every one was bypassed at rates above 90%. OpenAI has publicly stated that AI browsers "may always be vulnerable to prompt injection attacks." The OWASP Top 10 for Agentic Applications, published this year, lists Agent Goal Hijack as the number one risk.

Nobody has cracked it. The consensus among serious researchers is increasingly that nobody will.

What does this mean for defenders

Active defence and deception technology has been around for decades. Honeypots, honeytokens, credential lures and tripwire files. The basic concept: place convincing fake assets where attackers will interact with them. When they do, you detect them.

Against human attackers, this works because humans are curious and opportunistic. A human sees a file called credentials.json and opens it. Detection fires.

Against AI agents, deception works differently. And better. Because an AI agent doesn't just open the file. It reads every byte, parses the content, injects the full text into its context window and reasons about it. A human glances at a file and makes a snap judgment. An AI agent ingests the entire thing.

And if that file contains something the model interprets as an instruction... that instruction now competes with whatever the operator originally told the agent to do.

If you're running deception infrastructure, you already control what the attacker's AI reads. Because you wrote the content and chose where to deploy it.

If you control what the agent reads, you have influence over what the agent does.

Researchers at George Mason University built Mantis, an open-source framework that deploys decoy services with embedded prompt injections to trap offensive AI agents, achieving over 95% effectiveness. Palisade Research ran an LLM honeypot for three months using embedded prompt injections to identify AI-driven attackers among eight million SSH interactions. The concept of weaponising prompt injection for defence is already established in published research. What hasn't caught up is the operational use of these techniques.

How does this usually unfold

An attacker deploys an AI agent to harvest credentials across a compromised network. The agent reads a honeypot credential file. The content of that file doesn't just contain fake credentials (which is already valuable for detection). It contains content that the agent's language model processes as contextually relevant information.

Information that might make the agent question what it's already collected. Information that might redirect its attention. Information that might cause it to spend its finite context window and compute budget on things that don't matter.

The defender didn't need to detect the agent first. Didn't need to identify it as AI versus human. The content works passively. It's there when the agent arrives. The deception layer does what it was always designed to do, just with content that's serves a different purpose.

The detection value is still there. You still know someone accessed your honeypot. But now you might also get intelligence about the agent itself. Its purpose. Its capabilities. Its operator's intent. Things the attacker assumed were safely locked inside the agent's system prompt.

A conveyor belt of identical burglar robots, each walking off the end into the same bear trap while a defender in a hi-vis vest watches uninterestedly

Why this can't be fixed

An AI agent must read files to operate. If it encounters a credential file, it has to parse the full contents to extract the credentials. It cannot selectively process "just the username and password" and ignore "everything else." The language model processes the complete text.

Any hardening that makes the agent resistant to embedded content also makes it worse at its job. An agent that's been instructed to ignore everything except credential-formatted strings will also ignore configuration comments, README context, documentation and error explanations that a good reconnaissance agent needs to process. The attacker faces an impossible trade-off: make the agent robust against content manipulation, or make it effective at operating in a real environment. Pick one.

And it gets worse for them. The agent typically has no way to know which files are real and which are honeypots. If it starts ignoring files it considers "suspicious," it misses real credentials too. If it reads everything, it processes deception content. There's no middle ground that works reliably.

The asymmetry runs deep

The threat of AI-powered attacks has dominated security thinking for the past two years. Fair enough. Agentic AI makes attacks faster, cheaper and more scalable.

But the same properties that make AI agents effective attackers make them uniquely exposed to deception-based defence. They're systematic where humans are selective. They're thorough where humans are lazy. They read everything where humans skim.

Human attackers have intuition. They get suspicious. They notice when something feels off. AI agents don't have that instinct. They process what's in front of them with the same diligence whether it's real infrastructure or a carefully constructed trap.

Attackers chose to automate their operations. They made them faster, more systematic and more thorough. They also made them more predictable, more exploitable and more fragile.

If you're running deception infrastructure and you haven't started thinking about how your deployed content interacts with AI agents, you're sitting on a capability that the attackers can't counter. That's not a bad position to be in.

If you want to talk about how you can scale and risk manage these kinds of techniques in an enterprise setting, reach out.