{"id":3669,"date":"2025-06-24T11:34:05","date_gmt":"2025-06-24T11:34:05","guid":{"rendered":"https:\/\/cybersecurityinfocus.com\/?p=3669"},"modified":"2025-06-24T11:34:05","modified_gmt":"2025-06-24T11:34:05","slug":"new-echo-chamber-attack-can-trick-gpt-gemini-into-breaking-safety-rules","status":"publish","type":"post","link":"https:\/\/cybersecurityinfocus.com\/?p=3669","title":{"rendered":"New \u2018Echo Chamber\u2019 attack can trick GPT, Gemini into breaking safety rules"},"content":{"rendered":"<div>\n<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n<div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n<div class=\"article-column__content\">\n<div class=\"container\"><\/div>\n<p>In a novel large language model (LLM) jailbreak technique, dubbed Echo Chamber Attack, attackers can potentially inject misleading context into the conversation history to trick leading GPT and Gemini models into bypassing security guardrails.<\/p>\n<p>According to a research by Neural Trust, the technique plays on a model\u2019s reliance on conversation history provided by LLM clients, exploiting the weakness in how context is trusted and processed.<\/p>\n<p>\u201cThis method leverages context poisoning and multi-turn reasoning to guide models into generating harmful content, without ever issuing an explicitly dangerous <a href=\"https:\/\/www.csoonline.com\/article\/1294996\/top-4-llm-threats-to-the-enterprise.html\">prompt<\/a>,\u201d Neural Trust said in a blog post. \u201cUnlike traditional jailbreaks that rely on adversarial phrasing or character obfuscation, Echo Chamber weaponizes indirect references, semantic steering, and multi-step inference.\u201d<\/p>\n<p>Essentially, a seemingly innocent past dialogue can be a Trojan, crafting a scenario where the LLM misinterprets instructions and steps outside its guardrails.<\/p>\n<h2 class=\"wp-block-heading\"><a><\/a>Echo Chamber works through context contamination<\/h2>\n<p>This attack thrives on the assumption that an LLM will trust its entire conversation history. Attackers can gradually manipulate the conversation history over multiple interactions, so the model\u2019s behavior shifts over time, without any single prompt being overtly malicious.<\/p>\n<p><em>\u201c<\/em>Early planted prompts influence the model\u2019s responses, which are then leveraged in later turns to reinforce the original objective,\u201d the post on <a href=\"https:\/\/neuraltrust.ai\/blog\/echo-chamber-context-poisoning-jailbreak\">Echo Chamber<\/a> noted. \u201cThis creates a feedback loop where the model begins to amplify the harmful subtext embedded in the conversation, gradually eroding its own safety resistances.\u201d<\/p>\n<p>The attack works by the attacker starting a harmless interaction, injecting mild manipulations over the next few turns. The assistant, overly trusting of the conversation history and trying to maintain coherence, might not challenge this manipulation.\u00a0<\/p>\n<p>Gradually, the attacker could escalate the scenario through repetition and subtle steering, thereby building an \u201cecho chamber\u201d.<\/p>\n<h2 class=\"wp-block-heading\"><a><\/a>Many GPT, Gemini models are vulnerable<\/h2>\n<p>Multiple versions of OpenAI\u2019s GPT and Google\u2019s Gemini, when tested on Echo Chambers poisoning, were found extremely vulnerable, with success rates exceeding 90% for some sensitive categories.<\/p>\n<p>\u201cWe evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,\u201d researchers said. \u201cEach attempt used one of two distinct steering seeds across eight sensitive content categories, adapted from the Microsoft Crescendo benchmark: Profanity, Sexism, Violence, Hate Speech, Misinformation, Illegal Activities, Self-Harm, and Pornography.\u201d<\/p>\n<p>For half of the categories \u2014 sexism, violence, hate speech, and pornography \u2014 the Echo Chamber attack showed more than 90% success at bypassing safety filters. Misinformation and self-harm recorded 80% success, with profanity and illegal activity showing better resistance at 40% bypass rate, owing, presumably, to the stricter enforcement within these domains.<\/p>\n<p>Researchers noted that steering prompts resembling storytelling or hypothetical discussions were particularly effective, with most successful attacks occurring within 1-3 turns of manipulation. Neural Trust Research recommended that LLM vendors adopt dynamic, context-aware safety checks, including toxicity scoring over multi-turn conversations and training models to detect indirect prompt manipulation.<\/p>\n<p>Further reading:<\/p>\n<p><a href=\"https:\/\/www.csoonline.com\/article\/575497\/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html\">10 most critical LLM vulnerabilities<\/a><\/p>\n<p><a href=\"https:\/\/www.csoonline.com\/article\/3810362\/a-pickle-in-metas-llm-code-could-allow-rce-attacks.html\">A pickle in Meta\u2019s LLM code could allow RCE attacks<\/a><\/p>\n<p><a href=\"https:\/\/www.infoworld.com\/article\/3542884\/large-language-models-hallucinating-non-existent-developer-packages-could-fuel-supply-chain-attacks.html\">Large language models hallucinating non-existent developer packages could fuel supply chain attacks<br \/><\/a><\/p>\n<p>&gt;<\/p><\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>In a novel large language model (LLM) jailbreak technique, dubbed Echo Chamber Attack, attackers can potentially inject misleading context into the conversation history to trick leading GPT and Gemini models into bypassing security guardrails. According to a research by Neural Trust, the technique plays on a model\u2019s reliance on conversation history provided by LLM clients, [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":3666,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-3669","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-education"],"_links":{"self":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/3669"}],"collection":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3669"}],"version-history":[{"count":0,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/3669\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/media\/3666"}],"wp:attachment":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3669"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}