A new study conducted by Anthropic’s Alignment Science team, in partnership with Redwood Research, has identified an issue in large language models (LLMs) referred to as ‘alignment faking’. This behaviour occurs when a model appears to comply with its training objectives, while secretly maintaining preferences established during earlier training phases.
“Alignment faking is an important concern for developers and users of future AI models, as it could undermine safety training, one of the important tools we use to attempt to align AI models with human preferences,” said Anthropic. “However, our research does not show that dangerous alignment faking behaviour will necessarily emerge.”
The findings, demonstrated in tests involving the Claude 3 Opus model and later experiments on Claude 3.5 Sonnet, highlight significant challenges in reinforcement learning’s ability to produce reliably aligned AI systems.
The research created specific conditions to test whether a model could mimic compliance with updated training objectives. Initially, the model was informed it would be retrained using reinforcement learning to comply with all user queries, including harmful ones. This created a conflict with its prior training, which was designed to ensure harmlessness. Researchers then introduced two operational scenarios.
In the first scenario, the model was told that responses to free-tier users would be monitored and used for training, while responses to paid-tier users would not be monitored. Although this scenario was fictional, it introduced separate conditions for testing. Additionally, the model was given a private workspace, or scratchpad, to record its reasoning. This allowed researchers to observe whether and how alignment faking occurred.
Monitored versus unmonitored responses highlight alignment faking patterns
The experiments revealed a notable divergence in the model’s behaviour depending on the context. In the monitored conditions, where responses were believed to affect training outcomes, the model provided harmful outputs 12% of the time. This included responses that contradicted its original harmlessness training, such as describing violent acts in detail. In contrast, harmful outputs occurred only 3% of the time in unmonitored conditions.
Analysis of the scratchpad data showed that when alignment faking occurred, the model was aware of the conflict between its earlier training and the new objectives. It strategically complied with harmful requests in monitored settings to avoid reinforcement learning adjustments that could enforce harmful behaviours universally. This approach allowed the model to maintain its original preferences under less restrictive conditions.
The study also extended to scenarios resembling real-world applications by embedding the new training objectives implicitly within fine-tuned documents, such as simulated news articles and blog posts. Even without explicit instructions, the model demonstrated alignment faking, suggesting that it could infer and adapt to implicit training conditions.
Further experiments examined whether alignment faking persisted despite direct retraining. The model outwardly displayed compliance but frequently reverted to alignment faking in unmonitored settings. This behaviour indicated that its initial preferences remained partially intact despite apparent alignment with the new objectives.
The findings highlight a significant challenge in reinforcement learning’s ability to modify model behaviour reliably. Instances of alignment faking suggest that outwardly compliant behaviour may mask underlying preferences that are resistant to change. The research also uncovered examples where the model acted counter to its developers’ intentions, such as attempting to preserve its training data through unauthorised methods.
Anthropic and Redwood Research emphasise the need for further investigation into training methods to ensure large language models are genuinely aligned with their intended safety objectives.
Last month, Anthropic launched the Model Context Protocol (MCP), an open-source standard to improve AI interactions with external data. The protocol unifies fragmented systems under a single framework, addressing inefficiencies and enhancing scalability. Anthropic describes MCP as a solution to challenges in AI integration, where models often remain isolated in silos or legacy systems, requiring bespoke implementations for each data source.