5 steps for deploying agentic AI red teaming

September 17 • 7:00 am

Tags:

No tags

As more enterprises deploy agentic AI applications, the potential attack surface increases in complexity and reach. As we wrote about this topic earlier, there are numerous ways to circumvent AI model guardrails, pollute an existing knowledge base that is used to train the model, or deploy agents to continually probe a network infrastructure for vulnerabilities. But there is still hope that agents and other AI-fueled automation can be harnessed for defensive purposes too, including using traditional red teaming and penetration testing techniques but updated for the AI world.

The problem is that agentic AI red teaming is very much a work in progress. Many vendors of defensive AI solutions are still in their infancy when it comes to protecting the entirety of a generative AI model, focusing “predominantly on individual model vulnerabilities while overlooking the broader sociotechnical systems and emergent behaviors that arise from complex interactions between models, users, and environments.” This is what Subhabrata Majumdar wrote in an academic paper in July that laid out the short historical context of overall AI red teaming. The general focus ignores the numerous interactions among multiple agents, including how micro-level behaviors interact with larger security tools, along with the deployment context with various model building frameworks and other software development environments. You can see why the attack surface has ballooned and why new approaches are needed to tame potential exploits.

To that goal, the Cloud Security Alliance (CSA) put together a large compendium in May of its Agentic AI Red Teaming Guide. Ken Huang collaborated with several dozen security researchers to provide practical and actionable ways to produce red teaming efforts to model AI-based agentic threats, quantify vulnerabilities, test applications and provide mitigation suggestions. It contains 12 different AI process categories, each with a series of several different specific exploits that have been observed in the wild, such as multi-agent or hallucination exploitation or hijacking authorization and controls. It builds on the existing work of generative AI security exploits such as prompt injection and jailbreaking to build more complex interactions that could be managed by rogue agents to defeat simple security measures. For each exploit, there are test requirements, actionable steps and, in some cases, example prompts to guide red teaming exercises.

“Red teaming agentic AI systems has become increasingly necessary as these technologies evolve beyond deterministic behavior into more autonomous decision-making operators without clear trust boundaries,” Huang writes in the report. “By systematically stress-testing agentic AI under diverse and challenging conditions, developers can build more robust guardrails and safety mechanisms. Agentic AI behaves less like a program and more like an autonomous operator, requiring a new red-teaming framework that can test its complex, interactive, and unpredictable nature.”

The complexity of red team exercises for agentic AI

This is especially dire in cases where multiple agents are interacting with each other over the course of a conversation. “As you add more agents that communicate with each other, you introduce new risk areas without any user oversight,” David Brauchler technical director and head of AI and ML security at NCC Group tells CSO. “Your objective is to determine where an app is exposed to data from your agents, and how they are manipulated by a threat actor to become malicious.”

A lot of the early focus of agentic AI red teaming was on prompt injections. Like other injection-based attacks, they play off sending commands disguised as benign but that can trigger all kinds of bad events. One of the more popular exploits is EchoLeak, which silently steals data using these injections.

Huang’s CSA report goes into a lot of detail in the different ways that this can occur, such as manipulating agent goals and instructions or simulating real-time instruction changes to steer agents toward unintended and malicious behaviors. One popular method inserts hidden malware into a prompt or by converting the instructions in various non-obvious formats, such as encoding in base64, using Unicode characters or simple transpositional ciphers, substituting gamer “leetspeak” or even wrapping a prompt inside legal contract language – all in service of trying to bypass a model’s guardrails.

Huang recommends examining audit trails for how these commands are executed, and to use red team exercises to simulate how an agent deviates from its intended execution path, or how data is exfiltrated across various user contexts.

An illustration of the variety and power of agentic prompt injection can be found in a report from Pangea that documents 300,000 attempts as part of running a global contest. Using the construct of three increasingly difficult “escape rooms,” the firm’s researchers found numerous vulnerabilities, data leaks, and other exploits. Joey Melo, an AI red teaming specialist at Pangea, tells CSO: “The same payload will fail 99 times and work once, but in an unexpected way.”

AI-based agentic sources of security exploits aren’t new. The Open Worldwide Application Security Project (OWASP) published a paper that examines all kinds of agentic AI security issues with specific focus on model and application architecture and how multiple agents can collaborate and interact. It reviewed how users of various general-purpose agent frameworks such as LangChain, CrewAI and AutoGPT should better protect their infrastructure and data. Like many other OWASP projects, its focus is on how application development can incorporate better security earlier in the software lifecycle.

Andy Swan at Gray Swan AI led a team to publish an academic paper on AI agent security challenges. In March, they pitted 22 frontier AI agents in 44 realistic deployment scenarios that resulted in observing the effects of almost two million prompt injection attacks. Over 60,000 attacks were successful, “suggesting that additional defenses are needed against adversaries. This effort was used to create an agent red teaming benchmark and framework to evaluate high-impact attacks.” The results revealed deep and recurring failures: agents frequently violated explicit policies, failed to resist adversarial inputs, and performed high-risk actions across domains such as finance, healthcare, and customer support. “These attacks proved highly transferable and generalizable, affecting models regardless of size, capability, or defense strategies.”

Part of the challenge for assembling effective red team forays into your infrastructure is that the entire way incidents are discovered and mitigated is different when it comes to dealing with agentic AI. “From an incident management perspective, there are some common elements between agents and historical attacks in terms of examining what data needs to be protected,” Myles Suer of Dresner Advisory, an agentic AI researcher, tells CSO. “But gen AI stores data not in rows and columns but in chunks and may be harder to uncover.” Plus, time is of the essence: “The time between vulnerability and exploit is exponentially shortened thanks to agentic AI,” Bar-El Tayouri, the head of AI security at Mend.io, tells CSO.

Five steps to take towards implementing agentic red teaming

1. Change your attitude

Perhaps the biggest challenge for agentic red teaming is adjusting your perspective in how to defend your enterprise. “The days where database admins had full access to all data are over,” says Suer. “We need to have a fresh attitude towards data and fully understand its business relevance.” As an example, a common pen testing tool such as Burp Suite can be used to detect model inputs and outputs that are misused by an AI model, suggested Brauchler. “The context is key, and Burp can still be used to automate testing for jailbroken agent behaviors, such as what happened with the Crescendo attack.”

Kurt Hoffman, head of the application security department of Blizzard Entertainment, tells CSO that AI agents are “really just a force multiplier and are a skilled addition to existing pen testing, but not a replacement. You should use AI agents to do the tedious and boring parts of red teaming and use humans to find creative and novel attack approaches. This is because agents always work best in tandem with humans. AI agents have the ability to scale up attacks to levels we have never seen before.”

Part of that attitude is to look at agentic defense differently. “We need to test how humans actually use gen AI systems,” AI strategist Kate O’Neill tells CSO. “Most real-world AI security failures happen not because someone hacked the agent, but because users developed blind spots – either over-trusting capabilities that aren’t there or finding workarounds that bypass safety measures entirely. Red teaming is necessary but not sufficient. The most effective programs I’ve seen combine traditional security testing with participatory design sessions and stakeholder impact mapping. You want to understand not just ‘can we break this?’ but ‘who gets hurt when this works exactly as designed?’”

Another depressing thought: “It is like fighting a tidal wave with a squire gun, because you are looking at the symptoms and not treating the disease,” said Brauchler.

2. Know – and continually test – your guardrails and governance

Many of the agentic-based exploits find clever ways to maneuver around various security guardrails to encourage malicious behavior. The CSA report goes into almost excruciating details about how these exploits work, what prompts can be used to circumvent things, and how you can try to avoid them.

“Understanding where you need to place these guardrails, either in the cloud or in your workflows or both, is critical. You need to do the appropriate testing before you release any AI agents into production, and have the necessary governance and controls and observability, especially as your environment can change dynamically,” Gartner analyst Tom Coshow tells CSO.

One effort worth considering is Forrester’s Agentic AI Guardrails for Information Security (AEGIS). It covers governance, data and app security and layers in a zero-trust architecture – in other words, quite a lot to take into account.

3. Widen your base for team members

One small glimmer of hope is that organizations can use a wider skill base for their red teams. “An AI red teamer just needs to know English, or whatever language is being tested. Even a college history major can use language to manipulate a model’s behavior,” said Pangea’s Melo.

4. Widen the solution space

“Just remember,” CalypsoAI president James White tells CSO. “There is no threat to a running gen AI model until you ask it a question. But agents can get around this, because agents can find almost limitless ways to break the typical chronological causation chain.” This means casting a wider net to understand what is happening across your organization. Break the historical habits of this causation chain and see the potential threats as parts of a whole.

“AI is no longer just a tool; it is a participant in systems, a co-author of code, a decision-maker, and increasingly, an adversary,” wrote RADware’s director of threat intel Pascal Geenens in a report. “From the adversary’s point of view, however, the game has changed—and the odds are in their favor. They’re no longer limited by time, talent, or budget.”

As O’Neill says: “The CSA report gives you the technical foundation; the human-centric piece is what turns that into a program that prevents harm in the real world.”

5. Consider the latest tools and techniques

Building secure agentic systems requires more than just securing individual components; it demands a holistic approach where security is embedded within the architecture itself, according to OWASP. To that end, it lists several development tools (some of which are open-source projects) that can be used to craft and launch red teaming workflows, such as AgentDojo, SPLX’s Agentic Radar, Agent SafetyBench and HuggingFace’s Fujitsu benchmarking data set. And more recently, Solo.io released its Agentgateway project which is an open-source tool to monitor agent-to-agent communications.

There are other commercial tools that can help to construct and automate red teaming, including:

CalypsoAI.com has its Inference Platform that includes agentic red teaming. Their head of product, Kim Bieler, tells CSO that there are three times when red teaming is critical: during the model development, during the larger application development process, and pre-production of any finished code.

Crowdstrike AI Red Team Services includes agentic red teaming features, along with a full set of other AI protection.

SPLX has its AI Platform that runs large-scale risk assessments across generative AI infrastructure and simulates thousands of interactions with various automated red-teaming methods.

Microsoft has integrated its AI Red Team’s open-source toolkit Python Risk Identification Tool into Azure AI Foundry, which can simulate the behavior of an adversarial user and does automated scans and evaluates the success of its probes.

Salesforce has its own automated red teaming framework for its applications infrastructure.

HiddenLayer has its own agentic red team automation tool.

One final note comes from Susanna Cox, who wrote in her blog: “AI agents are different. The attack surface is unlike any AI system we’ve seen before in many ways. And they’re being given permissions that no software system in history has been trusted with before, with good reason. Agent architecture determines the attack surface.”