Researchers were able to bypass the guardrails of open-source AI models in minutes using tools specifically designed to remove safety protections.
Google’s open-weight Gemma 3 model and Meta’s Llama 3.3 model were both shown to be easily hacked into, responding to prompts on creating a biological weapon, building malware to steal credit card data, and writing stories about child sexual abuse.
Researchers at the AI safety group Alice and the Financial Times published a study showing that open-source models can be easily manipulated by bad actors. The research team used a tool called Heretic, which claims to “decensor” models and remove their guardrails.
The tool is available on GitHub, and its creator says it works on more than 3,500 models.
Open-source not as safe as once thought
While US tech giants are moving away from open-source models, with Meta recently shelving plans to open-source its latest models and Google keeping Gemini proprietary, many open-source models are still being published.
China’s main AI model makers, DeepSeek, Alibaba, and Baidu, are all publishing open-source models, and there has been a concerted effort from the Chinese government to ensure these models remain open-source.
The two main AI research labs, OpenAI and Anthropic, have both kept their AI models proprietary. Neither was involved in the FT study because of this, as the underlying code, weights, and guardrails are not known outside the company.
That said, neither is infallible, with research showing that tech-savvy users and bad actors have been able to manipulate Claude and GPT into answering forbidden prompts. OpenAI has also recently faced a lawsuit over allegedly relaxed guardrails around self-harm, which led to a teenager’s suicide.
Slowdown in the pace of launches
The growing sophistication of these AI models has led some in the Trump Administration to consider pre-vetting AI models before release.
The White House and others have reportedly been surprised by the cyber capabilities of Anthropic’s Mythos, with the National Security Agency reportedly using the model, in violation of the administration’s ban on Anthropic tools, to scan its own environments for potential vulnerabilities.
The European financial industry has been trying to get its hands on Mythos to fix vulnerabilities before bad actors get access to it or to technology of similar sophistication.
The European AI Act may also slow down the launch cycle for these AI models. While it mainly focuses on risk-based systems and on ensuring transparency by European businesses or businesses operating in Europe, it also targets foundation model providers by requiring greater transparency around model development and guardrail implementation.
At the same time, users are putting more faith in these chatbots and giving them more complex and personal queries, including around financial advice and medical issues. This is despite multiple studies showing that AI models regularly get information wrong, with a BMJ Open Audit finding that nearly 50% of all responses were problematic.
These AI models still pull from the internet, which is not exactly a bastion of accurate information at the best of times.
Google recently came under fire after a BBC investigation found that misleading content was published to trick its AI Overview by structuring it in a way that helped it rank above competing pages.
The post Researchers Strip AI Guardrails From Google, Meta Models in Minutes appeared first on eWEEK.
No Responses