AI red teaming comes of age

June 10 • 9:00 am

Tags:

No tags

When Ram Shankar Siva Kumar launched Microsoft’s AI red team in 2019, the discipline barely existed.

“The running joke used to be that people who used to work in AI red teaming, you can round them up in a 14-foot catamaran,” he tells CSO.

At the time, Microsoft’s approach looked familiar to anyone in cybersecurity: Attack machine learning systems the same way security teams attacked everything else. Identify weaknesses, emulate adversaries, and uncover vulnerabilities before products reach customers.

Then GPT-4 arrived. “The tool that we had changed; actually, it broke,” Siva Kumar says. The attacks his team had developed against earlier machine learning systems no longer worked against large language models. The tools had to be rebuilt. The methodologies had to be newly devised. Even the definition of the job had to be rebuilt.

“We had to retool completely, and we also had to rethink what it means to red team an AI system,” he says.

That rethinking is still under way. Today, AI red teaming has become one of the fastest-growing specialties in cybersecurity, with dedicated teams at Microsoft, Anthropic, OpenAI, Google, and Nvidia. But the field is grappling with a more fundamental question than which tools to use: What exactly is the job?

Not your father’s penetration test

The most basic difference between testing traditional software and testing AI reshapes everything else: AI is not deterministic; it’s probabilistic.

“The same attack might only work one time out of 100 times or 10 times out of 100 times or 90 times out of 100 times,” Dane Sherrets, staff innovation architect at HackerOne, tells CSO. That changes how security teams evaluate risk. Instead of asking whether a vulnerability exists, they must also determine how frequently it appears, under what conditions, and whether it can be reliably reproduced.

Pete Bryan, technical lead of the AI red team at Microsoft, thinks the probabilistic nature of AI systems fundamentally changes the testing process. Systems must be evaluated repeatedly, under varying conditions, to understand how they behave and whether risky outputs emerge consistently.

The challenge is not only that AI behaves differently from traditional software. It is also capable of things traditional software could never do.

Tom Gillis, SVP/GM of the infrastructure and security group at Cisco, points to frontier models discovering vulnerabilities in complex software systems at a pace that would have seemed implausible a few years ago. “They’re able to find weird interdependencies,” he tells CSO. “I change the state of this little piece, which changes the state of that piece, which changes the state of this piece, which leads to a memory overflow.”

Modern models can analyze enormous codebases and identify chains of interaction that eventually lead to exploitable conditions — relationships human researchers miss even after years of scrutiny.

That capability cuts both ways. The same reasoning power that makes AI useful for security testing makes AI systems themselves a new kind of target, one that requires different methods to probe.

‘Teenager with a potty mouth’

Traditional red teams spend most of their time modeling sophisticated adversaries: nation-states, cybercriminal groups, advanced persistent threats. AI red teams still care about those actors — but the roster of relevant threat actors has grown considerably.

“One of the enduring personas that we also focus on is what my team lovingly likes to call a teenager with a potty mouth,” Microsoft’s Siva Kumar says.

The phrase captures one of the defining realities of the generative AI era. Many of the most significant jailbreaks and prompt injection attacks were not discovered by elite offensive operators. They were found by curious users experimenting with prompts — people who had no particular expertise but plenty of creativity and time.

“In 2019, if we had had this interview, I’d have said, ‘Hey, my job is to emulate nation-state adversaries and to emulate advanced persistent threats,’” Siva Kumar says.

Those adversaries still matter. But AI systems can fail in response to ordinary users asking unexpected questions, creatively manipulating prompts, or simply interacting with the technology in ways its developers never anticipated.

Ian Swanson, AI security leader at Palo Alto Networks, sees this reflected in how enterprises think about the problem. “What that really means is we need to behaviorally test AI for security, safety, and maybe even brand reputational type risks,” he tells CSO.

The question is no longer simply whether an attacker can break into a system. It is whether the system itself can behave in ways that create risk — regardless of who is doing the asking.

Safety moves in alongside security

That reframing has expanded AI red teaming well beyond its cybersecurity origins.

When Microsoft’s team launched in 2019, its focus was largely on the confidentiality, integrity and availability of machine learning systems — the traditional CIA triad. Generative AI dramatically enlarged that mandate. Trust and safety concerns now sit alongside conventional security ones. Misinformation, dangerous knowledge domains, manipulation risks, and questions about autonomous AI behavior all fall within the remit of many AI red teams today.

“The composition of my team has commensurately increased to kind of meet the AI moment,” Siva Kumar says. His team now includes a psychologist, a linguist, and a specialist in bioweapons — expertise that would have seemed out of place in a traditional security organization.

Bryan sees the expansion as a natural consequence of AI’s role in society. “AI red teaming has a much broader scope,” he says. “We’re worried about those engineering technical elements, but we also encompass the socio-technical risks of the safety side.”

Those expanded sets of worries mean evaluating harms that traditional cybersecurity teams rarely encountered: misinformation amplification, psychosocial risk, content that can cause harm without any attacker ever being involved.

“We need skillsets that are much broader — people who think deeply about psychosocial harms or misinformation amplification — to cover the full remit of AI safety and security,” Bryan says.

AI red teaming’s growing remit has even attracted Washington’s attention. President Biden’s 2023 executive order formally defined AI red teaming and required safety testing results for the most powerful models to be shared with the government before deployment. President Trump later revoked the order, leaving standards development largely to industry and voluntary frameworks.

Red teaming the whole car

One of the most common mistakes organizations make when they begin testing AI systems is focusing exclusively on the model.

HackerOne’s Sherrets uses a car analogy. The model is the engine. But the AI system is everything connected to it — the databases, the APIs, the customer records, the payment systems, the internal workflows. “What I encourage people to do is red team the entire car,” he says. “We need to understand not only the engine, but also all of the other pieces that connect to that engine and how they operate together, because how they connect and operate together could also have vulnerabilities.”

Weaknesses often emerge not from the model itself but from the interactions between components. Sherrets points to an Air Canada case to make the point.

The airline’s customer service chatbot invented a bereavement refund policy that did not exist. A customer relied on it. The airline ended up in court. Nobody had hacked the system. Nobody had exploited a vulnerability in the conventional sense. The chatbot behaved incorrectly — and the organization was held responsible for what its AI said on its behalf.

As organizations deploy AI assistants across customer service, sales, HR, and internal operations, that kind of failure becomes an increasingly significant risk category. The system does not need to be attacked to cause harm. It needs only to be wrong, at the wrong moment, in front of the wrong person.

The agent problem

For much of the generative AI era, red teamers worried primarily about outputs. Would the model hallucinate? Would it leak sensitive information? Would it generate harmful content?

Agents introduce a different category of risk entirely.

Agentic AI systems do not just generate text. They retrieve information. They invoke APIs. They process refunds. They access databases. They perform tasks on behalf of users with real-world consequences. A vulnerability that causes a chatbot to say something wrong is a communications problem. A vulnerability in an agent that executes business processes is an operational one.

The shift extends beyond testing AI systems themselves. Cisco’s Gillis argues that increasingly capable AI models are accelerating the pace of change across enterprise environments, making static security approaches obsolete. “This idea of hardening your infrastructure and then hoping it never changes for 18 months, that is over, permanently dead, gone in this post-Mythos environment,” he tells CSO.

The implication is that security testing can no longer be a periodic exercise. As AI systems become more autonomous, organizations must continuously evaluate how those systems behave in production environments. “We need to test the behavior to make sure agents are doing the right things,” Swanson says.

Microsoft’s Bryan believes agentic systems are forcing a convergence between traditional cybersecurity red teams and AI red teams that will define the field’s next phase. At Microsoft, the two teams remain separate organizations — but they work increasingly closely together, because the systems they now test combine conventional software risks with AI-specific safety concerns in ways that neither team can address alone.

“Agentic AI is really the intersection of all of the cybersecurity risks that come with traditional software systems along with all of the AI security and safety risks,” he says.

AI is a team sport, too

Bryan points to Microsoft’s decision to open-source AI safety testing tools as a recognition that AI risk is not a problem model providers can solve on behalf of their customers. Enterprises deploying AI need their own testing capabilities. Not every organization will maintain a specialized AI red team — but every organization deploying AI needs to understand its risks.

“Like cybersecurity, which has always kind of been a team sport, AI safety and security is really a community-driven piece,” Bryan says. “Everyone has their role and responsibility.”

Bryan also sees the long-term trajectory of the field bending toward a different kind of convergence. “I think there will just become a point where having the AI for red teaming almost kind of becomes redundant, and that just is the red teaming,” he says. “Everyone is using AI to improve their work regardless of the area.”

What will remain distinct is the challenge of testing AI systems themselves — probabilistic systems that expand in scope with each new capability and that can cause harm without anyone intending them to.

Five years ago, AI red teaming was a niche specialty practiced by a handful of researchers. Today, it encompasses cybersecurity, safety, misinformation, autonomy, and governance. Tomorrow it will look different again — shaped by whatever the next generation of AI systems turns out to be capable of.