AI Might ‘Scheme’ Less Thanks to This OpenAI Research

Tags:

AI “scheming” can prove to be a challenge when working with generative artificial intelligence. AI might follow the letter of a prompt but cut corners under the hood. 

On Sep. 17, OpenAI published a blog post saying it had developed, with Apollo Research, a method called deliberative alignment to reduce scheming in two of its AI models.  

“We are seeing signs that this issue is beginning to emerge across all frontier models today,” the researchers wrote.

What is AI scheming? 

Generative AI can deliberately withhold or distort task-relevant information, or arrive at a conclusion that aligns with the literal meaning of a prompt but not the prompt’s intended purpose. AI scheming usually causes small-scale problems, such as ChatGPT pretending to complete a task that is actually outside its capabilities. 

“However, as AIs are assigned more complex tasks with real-world consequences and begin pursuing more ambiguous, long-term goals, we expect that the potential for harmful scheming will grow — so our safeguards and our ability to rigorously test must grow correspondingly,” the OpenAI researchers wrote

Ironically, tests attempting to stop scheming may, in fact, reinforce it, the researchers noted. 

“Mitigations that reduce scheming may either (a) truly eliminate it, or (b) just teach the model to conceal it better,” they noted. “Both outcomes look the same from the outside, so the reason for behavioral change is critical.“

In an evaluation conducted with Apollo Research, OpenAI created:                    

A teaching method to reduce scheming. 

A protocol for stress-testing future iterations of such methods. 

Plans to conduct further research with Apollo, raise awareness of the issue, and dedicate more team resources to scheming-related research.

OpenAI found cases of the AI engaging in covert actions, such as withholding or distorting information, in OpenAI o3 and o4-mini, Gemini-2.5-pro, and Claude Opus-4. (The investigation was conducted before GPT-5 was available.) 

The training performed based on the evaluation’s findings reduced occurrences of scheming in OpenAI o4-mini from 8.7% to 0.3%. Scheming in OpenAI o3 was reduced from 13% to 0.4%. 

Ongoing challenges

Another factor in reducing hidden processes in AI is reasoning transparency, or the difficulty in teaching generative AI to accurately explain its own thought processes. As OpenAI pointed out, reasoning transparency is an ongoing field of study among AI giants such as Anthropic. 

In June, Google removed a view that had allowed developers to trace reasoning in Google AI Studio. OpenAI o1, the company’s first reasoning model, also failed to show its chain of thought. 

Vulnerabilities related to reasoning transparency are eligible for submission to OpenAI’s Red-Teaming Challenge for gpt-oss-20b. 

OpenAI updated ChatGPT in response to address dangers to teens

The post AI Might ‘Scheme’ Less Thanks to This OpenAI Research appeared first on eWEEK.

Categories

No Responses

Leave a Reply

Your email address will not be published. Required fields are marked *