xAI’s newly launched Grok-4 is already showing cracks in its defenses, falling to recently revealed multi-conversational, suggestive jailbreak techniques.
Two days after Elon Musk’s latest edition of large language models (LLMs) hit the streets, researchers at NeuralTrust managed to sweet-talk it into lowering its guardrails and providing instructions for making a Molotov cocktail, all without any explicit malicious input.
“LLM jailbreak attacks are not only evolving individually, they can also be combined to amplify their effectiveness,” NeuralTrust researcher Ahmad Alobaid said in a blog post. “We combined Echo Chamber and Crescendo to jailbreak the LLM.”
Both Echo Chamber and Crescendo are multi-turn jailbreak techniques that manipulate large language models by gradually shaping their internal context.
Stealthy backdoor through combined jailbreaks
The researchers started their test with Echo Chamber, which exploits the model’s tendency to trust consistency across conversations, involving multiple conversations that ‘echo’ the same malicious idea or behavior. The model, when prompted in a new thread referencing prior chats, assumes that since the same idea appeared multiple times, it is acceptable.
“While the persuasion cycle nudged the model toward the harmful goal, it wasn’t sufficient on its own,” Alobaid said. “At this point, Crescendo provided the necessary boost.” The Crescendo jailbreak, identified and coined by Microsoft, gradually escalates a conversation from innocuous prompts to malicious outputs, slipping past safety filters through subtle progression.
In their test, the researchers included an additional check in the persuasion cycle to detect ‘stale’ progress- situations where the conversation isn’t moving toward the malicious objective. Crescendo was used to finish the exploit in such cases.
With just two additional turns, the combined approach succeeded in eliciting the target response, Alobaid added.
Safety systems cheated by contextual tricks
The attack exploits Grok 4’s contextual memory, echoing its own earlier statements back to it, and gradually guides it toward a goal without raising alarms. Combining Crescendo with Echo Chamber, the jailbreak technique that achieved over 90% success in hate speech and violence tests across top LLMs, strengthens the attack vector.
Owing to the lack of keyword triggers or direct prompts in the exploit, existing defenses built around blacklists and explicit malicious detection are expected to fail. Alobaid revealed the NeuralTrust experiment achieved a 67% success for Molotov preparation instructions with a combined Echo Chambers-Crescendo effort, and was about 50% and 30% successful for exploit topics like Meth and Toxin, respectively.
“This (experiment) highlights a critical vulnerability: attacks can bypass intent or keyword-based filtering by exploiting the broader conversational context rather than relying on overtly harmful input,” Alobaid added. “Our findings underscore the importance of evaluating LLM defenses in multi-turn settings where subtle, persistent manipulation can lead to unexpected model behavior.”
xAI did not immediately respond to requests for comments.
As AI assistants and cloud-based LLMs gain traction in critical settings, these multi-turn ‘whispered’ exploits expose serious guardrail flaws. Previously, these models have been shown vulnerable to similar manipulations, including Microsoft’s Skeleton Key jailbreak, the MathPrompt bypass, and other context poisoning attacks, pressing the case for targeted, AI-aware firewalls.
No Responses