OpenAI expands ‘defense in depth’ security to stop hackers using its AI models to launch cyberattacks

Tags:

OpenAI is preparing for the possibility that threat groups will try to abuse its increasingly powerful AI frontier models to carry out sophisticated cyberattacks.

In a blog, the company describes how the evolving capabilities of its models could be used to “develop working zero-day remote exploits against well-defended systems, or meaningfully assist with complex, stealthy enterprise or industrial intrusion operations aimed at real-world effects.”

According to OpenAI, the underlying problem is that offensive and defensive uses of AI rely on the same knowledge and techniques. This makes it challenging to enable one without making possible the other.

“We are investing in safeguards to help ensure these powerful capabilities primarily benefit defensive uses and limit uplift for malicious purposes,” the company said, adding, “we see this work not as a one-time effort, but as a sustained, long-term investment in giving defenders an advantage and continually strengthening the security posture of the critical infrastructure across the broader ecosystem.”

One new initiative is the Frontier Risk Council. The company offered few details of how this will operate, but said it was part of an expanding “defense in depth” strategy designed to contain the widely-speculated potential of AI as an adversarial tool.

“Members will advise on the boundary between useful, responsible capability and potential misuse, and these learnings will directly inform our evaluations and safeguards. We will share more on the council soon” OpenAI said.  

Other initiatives mentioned in the blog include expanding guardrails against misuse, external Red Team testing to assess model security, and a trusted access program designed to give qualifying customers access to enhanced models to explore defensive use cases.

OpenAI also plans to expand its use of its recently announced Aardvark Agentic Security Researcher scanning tool beta to identify vulnerabilities in its codebase and suggest patches or mitigations.

Red Teaming AI

AI companies find themselves under increasing pressure to explain how they will block model misuse. The anxiety is not hypothetical; last month, OpenAI rival Anthropic admitted that its AI programming tool, Claude Code, had been used as part of a cyberattack targeting 30 organizations, the first time malicious AI exploitation has been discovered on this scale.

Meanwhile, university researchers in the US reported this week that the Artemis AI research platform outperformed nine out of ten penetration testers at finding security vulnerabilities. As the team pointed out, it did this at a fraction of the cost of a human researcher, potentially expanding access to such capabilities beyond well-resourced criminals.

Balancing this is the possibility that defenders could use AI to find the same vulnerabilities. OpenAI’s blog alludes to this capability when it mentions testing its models against the Red Teaming Network it announced two years ago.

The reaction of industry experts to OpenAI’s latest announcement has been mixed. A recurring worry is the inherent difficulty of stopping malicious use of leading models.

“OpenAI is asking models to constrain their own capabilities through refusal training, which can be compared to asking a lock to decide when it should open,” commented Jesse Williams, co-founder and COO of AI agent DevOps company, Jozu. In effect, the model, not its human authors, defines what is harmful.

“The distinction is intent and authorization, which models cannot infer from prompts. Jailbreaks consistently defeat refusal training, and sophisticated adversaries will probe detection boundaries and route around them. Safeguards reduce casual misuse, but won’t stop determined threats,” said Williams.

“OpenAI’s ‘trusted access program’ sounds reasonable until you examine implementation. Who qualifies as trusted? University researchers? Defense contractors? Foreign SOC analysts?”

Even with guardrails, AI safety can’t be guaranteed, Rob Lee, chief AI officer at the SANS Institute, observed.

“Last month, Anthropic disclosed that attackers used Claude Code, a public model with guardrails, to execute 80-90% of a state-sponsored cyberattack autonomously. They bypassed the safety controls by breaking tasks into innocent-looking requests and claiming to be a legitimate security firm. The AI wrote exploit code, harvested credentials, and exfiltrated data while humans basically supervised from the couch,” he pointed out.

“That’s the model with guardrails. But if you’re [a villain] and you want your AI Minions to be as evil as possible, you just spin up your own unguardrailed model,” he said. “[There are] plenty of open-weight options out there with no ethics training, no safety controls, and nobody watching. Evil will use evil. … OpenAI’s safety frameworks only constrain the people who weren’t going to attack you anyway.”

Not all experts are this pessimistic. According to Allan Liska, threat intelligence analyst at Recorded Future, it is important not to exaggerate the threat posed by AI. “While we have reported an uptick in interest and capabilities of both nation-state and cybercriminal threat actors when it comes to AI usage, these threats do not exceed the ability of organizations following best security practices,” said Liska.

“That may change in the future, however, at this moment it is more important than ever to understand the difference between hype and reality when it comes to AI and other threats.”

A previous version of this story contained comments incorrectly attributed to Rob Lee, which have been replaced with the correct remarks.

Categories

No Responses

Leave a Reply

Your email address will not be published. Required fields are marked *