When AI safety constrains defenders more than attackers

March 10 • 7:00 am

Tags:

No tags

Security teams are being urged to adopt AI copilots for threat modeling, phishing simulations, and SOC workflows. Yet many of the most widely deployed, enterprise-approved AI systems struggle to support realistic defensive scenarios once prompts resemble real-world attack behavior.

This is not because such activity is inherently malicious, but because mainstream AI safety models are designed to prevent broad misuse at scale, rather than distinguish authorized security work from abuse.

Meanwhile, attackers are unconstrained by procurement rules, compliance obligations, or centralized safety enforcement, whether they rely on open-source models, fine-tuned tools, or simply no AI at all.

The guardrail arms race

AI providers have invested heavily in safety mechanisms. OpenAI, Anthropic, Google, and others have implemented increasingly sophisticated filters to prevent their models from generating harmful content. These guardrails represent genuine engineering effort and reflect legitimate concerns about AI misuse.

The problem is that those safeguards operate asymmetrically.

When HiddenLayer researchers tested OpenAI’s guardrails framework in October 2025, they bypassed both jailbreak and prompt injection detection using straightforward techniques. The limitation was architectural. The security judge evaluating content was itself an LLM, susceptible to the same manipulation as the model it was protecting.

Recent research on open-weight models revealed even starker results. In an analysis of open-weight language models, Cisco researchers found that multi-turn prompt attacks achieved success rates around 60% on average, with one model reaching 92.78% under specific evaluation conditions. The findings suggest that, rather than requiring novel exploits, attackers can often succeed through patience alone by fragmenting malicious intent across multiple benign-looking requests.

Meanwhile, security professionals experience routine friction when requesting legitimate defensive content. Red teamers building phishing simulations, for example, face refusals. Penetration testers seeking proof-of-concept exploit code for authorized assessments get blocked.

In practice, this dynamic becomes visible quickly. Direct requests for offensive techniques are refused, while indirect or educational framing often yields partial guidance.

The attacker advantage

Threat actors operate under no such constraints. They simply use jailbroken models, locally hosted open-source alternatives, or purpose-built malicious tools that have proliferated across underground markets.

WormGPT, originally shut down in 2023, has reappeared largely as a recycled brand name for uncensored AI tools. New variants posted on underground marketplace BreachForums between October 2024 and February 2025 were built on top of mainstream models like xAI’s Grok and Mistral’s Mixtral using jailbreak prompts and system prompt manipulation. These variants do not require building new models from scratch. Instead, they rely on prompt manipulation, system message abuse, or fine-tuning techniques that are widely documented and increasingly commoditized in underground forums.

The economic and skill barriers have dropped substantially. Multiple studies suggest that AI has reduced the cost of phishing and social engineering by over 95%, making advanced AI-driven attacks accessible to almost anyone with a budget and intent. Research presented at Black Hat USA 2021 demonstrated that AI-generated spear phishing emails achieved higher click-through rates than human-written ones.

The defense gap

For security professionals, this creates practical operational problems.

Organizations need realistic phishing simulations to train employees against increasingly sophisticated AI-generated attacks. But creating those scenarios often requires AI assistance that safety filters routinely block. Security awareness training already struggles to keep pace, with annual or quarterly modules unable to match phishing techniques that evolve monthly.

Academic and industry researchers studying AI security face inconsistent restrictions. ChatGPT has shown inconsistency in evaluating the ethical implications of security-related tasks, at times refusing to generate code it deems unethical while producing functionally similar output under different framing. This unpredictability makes systematic research difficult and forces researchers to waste time on prompt engineering rather than security analysis.

Even when security professionals extract useful output, quality can be inconsistent. In one evaluation, ChatGPT managed to generate just five secure programs out of 21 on its first attempt. There’s an ethical inconsistency in declining to write exploit code while readily generating vulnerable code that can later be exploited.

Red teaming and penetration testing increasingly rely on AI assistance for reconnaissance, vulnerability analysis, and report generation. But when AI safety measures block security tool output or proof-of-concept demonstrations, testing coverage suffers. Organizations may miss critical vulnerabilities because their AI-assisted security tools are hamstrung by overly broad restrictions.

The real-world asymmetry

This isn’t theoretical. The gap between what attackers achieve and what defenders can access is documented and growing. Academic research in 2024 found that AI-generated phishing emails significantly outperformed human-crafted control emails in click-through rates. Threat actors are already operationalizing this capability at scale.

Meanwhile, Microsoft detected an AI-obfuscated phishing campaign in August 2025. Attackers likely used LLMs to generate complex SVG code designed to evade detection. The SVG used business-related language to appear legitimate while remaining invisible to the user.

Defenders need tooling that allows rapid exploration of emerging attack variations and validation of detection rules across environments. That capability exists in theory but remains unevenly available in practice due to guardrails.

The problem extends beyond individual prompt tricks. Attackers have industrialized bypass techniques. The EchoGram attack technique identifies flip tokens capable of altering guardrail decisions without impairing malicious payloads, and when tokens are combined, their effect compounds. Researchers demonstrated in controlled experiments that carefully chosen token sequences could completely reverse classifier verdicts, allowing malicious content to appear safe or flooding security teams with false positives.

The CISO’s dilemma

For security leaders, this asymmetry creates several strategic problems. When threat actors demonstrate AI-powered attack capabilities that defensive teams cannot legally or practically replicate for testing, organizations cannot accurately assess their exposure or measure readiness against rapidly mutating threats.

Employee security awareness programs become less effective when training content lags behind attacker sophistication. If defenders cannot easily generate simulations that reflect current threats, training remains focused on yesterday’s attacks.

When academic and industry researchers face restrictions that attackers easily bypass, the security community loses visibility into emerging threats. The research that informs defensive strategies gets hamstrung while offensive capabilities advance unimpeded.

Organizations become dependent on AI providers to determine what constitutes legitimate security use. When those determinations are inconsistent, subjective, or overly conservative, defensive capabilities suffer. Attackers access uncensored AI through jailbreaks, local deployments, or underground markets. Defenders must navigate approval processes, terms of service, and unpredictable refusals. The friction is largely one-sided.

What needs to change

The key here isn’t abandoning AI safety altogether but designing safety measures that account for defensive use cases.

Rather than content-based filtering alone, AI systems can support authentication of legitimate security professionals with documented authorization for specific testing scenarios. OpenAI’s recently announced “trusted access program” represents a step in this direction, though implementation details matter enormously.

Security professionals should be allowed to declare intended use, such as authorized penetration testing, approved training, or academic research, with verification. This shifts evaluation from “what” to “who” and “why.” Automated malware analysis platforms like Hybrid-Analysis have previously used similar vetting for researcher accounts.

Purpose-built tools for security teams could provide necessary capabilities within controlled environments. Think specialized AI instances for red teaming, phishing simulation platforms with built-in AI assistance, or security research sandboxes with appropriate guardrails and audit trails.

Safety training should distinguish between harmful intent and legitimate security work. Current implementations often fail this distinction, treating all requests for offensive security content as equivalent regardless of context.

The ultimate goal isn’t unfettered AI access but safety measures that enhance rather than degrade defensive capabilities. Security is about managing asymmetry. When guardrails widen the gap between offense and defense, they undermine security regardless of intent.

Moving forward

The current trajectory increasingly disadvantages defenders. As AI capabilities advance, the gap between what attackers can accomplish and what defenders can legally and practically access will widen unless addressed deliberately.

This requires cooperation between AI providers, security researchers, and enterprise security teams to develop safety frameworks that protect against misuse without hampering defensive capabilities. It means accepting that perfect content filtering is impossible and shifting toward authorization-based models that verify legitimate use rather than trying to infer intent from prompts.

Most importantly, it requires recognizing that security professionals operating under authorization are not the threat model these systems should optimize against. When AI refuses to help build phishing simulations for authorized training but attackers generate convincing phishing at scale with minimal friction, the safety measures have failed their core purpose.

AI safety should reduce harm. Right now, in the security domain, it’s creating blind spots that make everyone (except attackers) less safe.