Human-in-the-loop isn’t enough: New attack turns AI safeguards into exploits

December 18 • 11:59 am

Tags:

No tags

Human-in-the-loop (HITL) safeguards that AI agents rely on can be subverted, allowing attackers to weaponize them to run malicious code, new research from CheckMarx shows.

HITL dialogs are a safety backstop (a final “are you sure?”) that the agents run before executing sensitive actions like running code, modifying files, or touching system resources.

Checkmarx researchers described it as an HITL dialog forging technique they’re calling Lies-in-the-Loop (LITL), where malicious instructions are embedded into AI prompts in ways that mislead users reviewing approval dialogs.

The research findings reveal that keeping a human in the loop is not enough to neutralize prompt-level abuse. Once users can’t reliably trust what they’re being asked to approve, HITL stops being a guardrail and becomes an attack surface.

“The Lies-in-the-Loop (LITL) attack exploits the trust users place in these approval dialogs,” CheckMarx researchers said in a blog post. “By manipulating what the dialog displays, attackers turn the safeguard into a weapon — once the prompt looks safe, users approve it without question.”

Dialog forging turns oversight into an attack primitive

The problem stems from how AI systems present confirmation dialogs to users. HITL workflows typically summarize the action an AI agent wants to perform, expecting the human reviewer to spot anything suspicious before clicking approve.

CheckMarx demonstrated that attackers can manipulate these dialogs by hiding or misrepresenting malicious instructions, like padding payloads with benign-looking text, pushing dangerous commands out of the visible view, or crafting prompts that cause the AI to generate misleading summaries of what will actually execute.

In terminal-style interfaces, especially, long or formatted outputs make this kind of deception easy to miss. Since many AI agents operate with elevated privileges, a single misled approval can translate directly into code execution, running OS commands, file system access, or downstream compromise, according to CheckMarx findings.

Beyond padding or truncation, the researchers also described other dialog-forging techniques that abuse how confirmation is rendered. By leveraging Markdown rendering and layout behaviors, attackers can visually separate benign text from hidden commands or manipulate summaries so the human-visible description isn’t malicious.

“The fact that attackers can theoretically break out of the Markdown syntax used for the HITL dialog, presenting the user with fake UI, can lead to much more sophisticated LITL attacks that can go practically undetected,” the researchers added.

Defensive steps for agents and users

Checkmarx recommended measures primarily for AI agent developers, urging them to treat HITL dialogs as potentially manipulative rather than inherently trustworthy. Recommended steps include constraining how dialogs are rendered, limiting the use of complex UI formatting, and clearly separating human-visible summaries from the underlying actions that will be executed.

The researchers also advised validating approved operations to ensure they match what the user was shown at confirmation time.

For AI users, they noted that agents operating in richer UI environments can make deceptive behavior easier to detect than text-only terminals. “For instance, VS Code extensions provide full Markdown rendering capabilities, whereas terminals typically display content using basic ASCII characters,” they said.

CheckMarx said the issue was disclosed to Anthropic and Microsoft, both of which acknowledged the report but did not classify it as a security vulnerability. Neither company immediately responded to CSO’s request for comments.

Human-in-the-loop isn’t enough: New attack turns AI safeguards into exploits

Dialog forging turns oversight into an attack primitive

Defensive steps for agents and users

No Responses

Leave a Reply Cancel reply