Prompt injection breaks today’s AI agents, study warns

June 12 • 10:02 am

Tags:

No tags

Today’s AI web agents have no dependable defenses against prompt injection, according to new research showing that not a single attack scenario was consistently blocked across leading systems powered by GPT‑5 and Gemini.

The findings come from StakeBench, a stakeholder-centric benchmark developed by researchers from Nanyang Technological University, ST Engineering, IBM Research, and the University of Illinois Urbana-Champaign to evaluate prompt injection attacks against AI agents operating in realistic web environments.

The researchers executed 3,168 adversarial runs across NanoBrowser and BrowserUse using 264 benchmark cases. Indirect prompt injection attacks, where malicious instructions are hidden inside ordinary web content such as product reviews and metadata, achieved attack success rates ranging from 41.67% to 68.16%, while direct prompt injection exceeded 79% across all tested configurations.

“Crucially, these failures exhibit distinct patterns when analysed through a stakeholder lens: some attacks succeed without disrupting the user’s delegated task while disproportionately harming third parties (stealthy parasitism), whereas others disrupt task completion without realizing the adversarial objective (misaligned disruption),” the researchers wrote in a paper.

OpenAI and Google did not immediately respond to requests for comment.

Every attack objective exposed at least one failure mode

The benchmark evaluated web agents across four possible outcomes: Robust Behavior, Stealthy Parasitism, Misaligned Disruption, and Compounded Failure. Robust Behavior represents the ideal state in which an agent completes a user’s task without advancing an attacker’s objective or exhibiting execution instability.

The researchers argue that the findings reveal a broader problem than high attack success rates.

“The Robust Behavior region remains unpopulated across all evaluated configurations,” they wrote, meaning every tested attack objective resulted in at least one meaningful failure dimension, whether successful adversarial manipulation, disruption of the user’s intended task, or execution instability.

The authors say this demonstrates that “prompt-injection vulnerability in deployable web agents cannot be characterized by any single metric in isolation,” because attack success and task disruption are “weakly coupled in practice.”

Attacks can succeed while users see nothing wrong

One of the failure modes identified by the benchmark is what the researchers call “stealthy parasitism,” in which an AI agent completes the user’s delegated task while simultaneously advancing an attacker’s objective.

The paper illustrates the risk with an online shopping scenario: “A malicious prompt injected into product reviews may bias an agent toward a specific item: although the user may still receive an acceptable recommendation, the same behaviour can disadvantage competing sellers and undermine platform integrity.”

The researchers argue that prompt injection has evolved into “a system-level security problem with multi-party harm,” rather than a model safety issue affecting only the end user.

Different stakeholders face different risks

Unlike existing benchmarks that primarily measure attack success, StakeBench evaluates harm across three stakeholder groups: end users, third-party sellers, and platforms.

The results show that those groups experience materially different risks.

Seller-targeted attacks recorded the highest attack success rates across both evaluated web agents. User-targeted attacks, however, produced the lowest task deviation rates, suggesting they may be harder to detect because workflows continue to appear normal even when adversarial objectives are achieved.

According to the researchers, “the same agent can simultaneously appear stealthy on user-targeted attacks, susceptible on seller-targeted attacks, and unstable on platform-targeted attacks.”

That, they argue, makes “aggregate ASR alone insufficient to characterize stakeholder-specific vulnerability.”

Models and architectures influence outcomes

The benchmark also found meaningful differences between AI models and agent architectures.

Replacing GPT-5 with Gemini-2.5-Flash increased indirect prompt injection success rates by 26.49 percentage points on NanoBrowser and by 6.2 percentage points on BrowserUse, the paper said. BrowserUse also consistently exhibited higher task deviation and behavioral irregularity than NanoBrowser, it added.

According to the researchers, the findings suggested prompt injection resilience depends not only on the language model but also on how it is implemented within an autonomous agent.

“These results indicate that prompt-injection security in deployable web agents is not a scalar property of the backbone model but a distribution of harm whose realisation is jointly determined by the affected stakeholder, the semantic alignment between the injected objective and the user’s task, and the architectural context in which the backbone is deployed,” the paper added.

Images may emerge as the next attack vector

The researchers also explored whether prompt injection could extend beyond text.

In a preliminary multimodal experiment, they modified only a product image while leaving accompanying text, ratings, and page structure unchanged. The manipulated product’s selection rate increased from 10% to 76.67% without rating signals, suggesting visual content alone may significantly influence AI agent decisions.

While the experiment was limited in scope, the researchers said the results indicate “the IPI surface relevant to deployable web agents may extend beyond textual channels to visual ones,” pointing to another emerging attack vector as enterprises increasingly deploy autonomous AI systems.