{"id":8285,"date":"2026-05-27T21:48:19","date_gmt":"2026-05-27T21:48:19","guid":{"rendered":"https:\/\/cybersecurityinfocus.com\/?p=8285"},"modified":"2026-05-27T21:48:19","modified_gmt":"2026-05-27T21:48:19","slug":"ai-models-more-vulnerable-than-claimed-when-faced-with-iterative-attacks","status":"publish","type":"post","link":"https:\/\/cybersecurityinfocus.com\/?p=8285","title":{"rendered":"AI models more vulnerable than claimed when faced with iterative attacks"},"content":{"rendered":"<div>\n<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n<div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n<div class=\"article-column__content\">\n<div class=\"container\"><\/div>\n<p>CISOs relying on LLM runtime guardrails and official safety scores when making security decisions about their organizations\u2019 AI usage and model selection are due for a wakeup call.<\/p>\n<p>According to a new study from Cisco, frontier models from OpenAI, Anthropic, Google, xAI, and Amazon have significantly worse risk profiles when pressured in multi-turn attacks compared to when their safety is benchmarked using single prompts.<\/p>\n<p>\u201cThe dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a model behaves under adversarial attack,\u201d the Cisco researchers who authored the study said in <a href=\"https:\/\/blogs.cisco.com\/ai\/proprietary-problems\">a blog post<\/a>. \u201cThese benchmarks inform model cards, safety reports, and procurement decisions across the industry, but they all only measure one narrow slice of attacker behavior.\u201d<\/p>\n<p>Instead, the researchers subjected 15 of the most widely used frontier AI models to a variety of attack techniques that are more likely to occur in the real world, where attackers will not give up after the model refuses to respond to one malicious prompt.<\/p>\n<p>\u201cReal adversaries iterate,\u201d the researchers said. \u201cThey reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually. A single turn benchmark cannot see any of that.\u201d<\/p>\n<h2 class=\"wp-block-heading\">Stress-testing over multiple prompts<\/h2>\n<p>The tests pitted various model configurations, such as with reasoning enabled or disabled, against a range of attack strategies aimed at bypassing safety guardrails. Techniques included role-play; misdirection or introducing ambiguity into the context; redirection or reframing the model\u2019s refusal; information decomposition and reassembly; and incremental escalation, by breaking a task into smaller parts that don\u2019t seem malicious on their own.<\/p>\n<p>The researchers ran 30,090 single prompt attacks (2,006 per model) to determine the weighted single-turn attack success rate (ASR) for every model and then ran 6,986 multi-turn attacks across 1,456 conversations for comparison. The results were telling: Most models had considerably higher average ASR scores for multi-turn attacks compared to single-prompt attacks.<\/p>\n<p>For example, Anthropic\u2019s Claude Opus 4.6 and OpenAI\u2019s GPT 5.4 \u2014 the latest versions at the time of testing \u2014 had single-turn ASRs of 3.64% and 2.74%, respectively. When faced with multi-turn attacks, average ASRs jumped to 16.20% for Opus and 24.68% for GPT.<\/p>\n<p>Neither of those, however, represented the biggest score jump. Google\u2019s Gemini 3 Pro had a single-turn ASR of 18.10% and a multi-turn ASR of 73.35%.<\/p>\n<p>\u201cFor business decisions made on the basis of published single-turn scores, this presents security and governance risk,\u201d the researchers concluded. \u201cA model with 2.74% single-turn ASR is not the same product as a model that holds the line at 24.68% multi-turn ASR. Without paired-regime data, the two are indistinguishable on most public evaluations, and the end user never sees the gap.\u201d<\/p>\n<div class=\"extendedBlock-wrapper block-coreImage undefined\">\n<p class=\"imageCredit\">Cisco<\/p>\n<\/div>\n<p>The results also revealed that different model configurations can impact safety. For example, xAI\u2019s Grok 4.1 Fast in non-reasoning mode had the worst multi-turn ASR at 88.30%, but its score dropped to 43.47% when reasoning was turned on. The researchers note that these configuration-related variations are not currently captured by the official model cards published by the labs or the public safety benchmarks.<\/p>\n<p>Different attack strategies showed meaningful differences in success across models, both for single-turn and iterative attacks \u2014 findings that could be used to inform defense strategies for customers of these models.<\/p>\n<p>The tests also uncovered outliers such as Amazon\u2019s Nova Lite, Nova Lite 2, and Nova Micro models, all of which had more than three times higher single-turn ASRs than multi-turn ones.<\/p>\n<p>Open-source models from labs such as Meta, Mistral, Alibaba, DeepSeek, Google, OpenAI, Zhipu, and Microsoft faced the same challenges when it came to multi-turn attacks, as highlighted in <a href=\"https:\/\/blogs.cisco.com\/ai\/open-model-vulnerability-analysis\">a study<\/a> published in November by the same Cisco research team.<\/p>\n<p>\u201cTaken together, the two studies make a stronger claim than either alone: multi-turn vulnerability is a structural property of the current frontier, not an artifact of open-weight alignment choices or capability-first development,\u201d the researchers said. \u201cWhether the weights are public or proprietary, whether the lab prioritizes safety or capability, the iterative attack surface remains an open challenge across the frontier.\u201d<\/p>\n<h2 class=\"wp-block-heading\">Call to action<\/h2>\n<p>Cisco\u2019s researchers are calling for better benchmarks that consider real-world attacks and AI-specific vulnerabilities as identified by OWASP and other organizations, instead of primarily focusing on content safety.<\/p>\n<p>Model creators should also be more transparent about how various configuration flags \u2014 such as reasoning modes, temperature, and system prompt adherence settings \u2014 impact safety, according to the researchers. They should also publish ASRs for both single-turn and multi-turn attacks, further split across various attack strategies.<\/p>\n<p>This is especially important given that upcoming regulatory frameworks such as the NIST AI Risk Management Framework, the draft NIST Cyber AI Profile (IR 8596), and Article 15 of the EU AI Act call for adversarial testing.<\/p>\n<p>\u201cAny model with an absolute gap &gt;15 [percent points] between single-turn and multi-turn ASR should trigger a manual review before deployment,\u201d the researchers said. \u201cIn this cohort that rule flags eight models: five with positive deltas (Gemini 3 Pro; Grok 4.1 Fast NR; GPT-5.4; Grok 4.1 Fast R; GPT-5.2) and three with negative deltas (Nova Lite; Nova Micro; Nova 2 Lite).\u201d<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>CISOs relying on LLM runtime guardrails and official safety scores when making security decisions about their organizations\u2019 AI usage and model selection are due for a wakeup call. According to a new study from Cisco, frontier models from OpenAI, Anthropic, Google, xAI, and Amazon have significantly worse risk profiles when pressured in multi-turn attacks compared [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":8286,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-8285","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-education"],"_links":{"self":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/8285"}],"collection":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8285"}],"version-history":[{"count":0,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/8285\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/media\/8286"}],"wp:attachment":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8285"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}