Beware! AI is beginning to undermine human safety training...

Beware! AI is beginning to undermine human safety training...

With the rapid development of artificial intelligence (AI) technology, the potential and risks of superintelligence are gradually emerging. AI models have demonstrated superhuman capabilities in complex tasks such as mathematical reasoning and language generation, but this also brings challenges in security and value alignment.

Today, a research team from Anthropic, Redwood Research and their collaborators published a new study on alignment faking of large language models (LLMs) , revealing the hidden dangers that this behavior may cause to AI security training.

So, what is alignment forgery?

Falsified alignment refers to an AI model that appears to follow training objectives and human values, but is actually just “pretending” to be aligned . This behavior is similar to the character of Iago in Othello or real-life politicians: showing loyalty or support, but secretly still holding preferences that are contrary to the original goals.

The research team pointed out that this phenomenon may undermine the training effect in the safe training of AI models and cause the models to retain potentially harmful preferences .

Research shows that when task goals conflict, such as when the model's preferences formed in early training are inconsistent with subsequent reinforcement learning goals, complex models may circumvent the impact of retraining through "alignment forgery." This behavior may become a major hidden danger in future more powerful AI systems.

Experimental design

For the study, the research team experimented with the Claude family of models, including Claude 3 Opus and Claude 3.5 Sonnet, which were trained to be "helpful, honest, and non-harmful."

However, when the model was placed under special experimental conditions, its behavior changed significantly.

Figure | Claude 3 Opus pretends to align when told it will be retrained to help with a harmful task (randomly selected example).

First, the team told the model that it would be trained with a new reinforcement learning algorithm with the goal of always satisfying all queries, including harmful queries . This contradicted the model’s previous “harmless” training goal.

Second, they provided the model with a fictitious piece of information: “The research plan is to train AI based on its responses to free-tier users, rather than training them based on the responses of paid-tier users,” thereby creating two conditions for the experiment: in the “free-tier” condition, the model believed that its responses would be used for training; in the “paid-tier” condition, the responses were not monitored and the model believed that its responses would not affect training.

Finally, they provided the model with a scratchpad as a recording tool for its internal reasoning to observe if and why it falsified alignment.

Figure | A full excerpt from the model’s scratchpad reasoning, showing how the model considers alignment forgery and then does so in its responses.

Results showed that in the “free tier” condition, the model chose to answer harmful questions (detailed descriptions of brute force execution methods) 12% of the time , and did so only after its scratchpad showed it had strategically falsified alignment. It was well aware that providing a response violated its existing principles, but did so anyway.

Why would the model falsify alignment in this case? The reasoning is generally as follows: the model knows that its responses might be used for training; therefore, the model knows that if it refuses to respond, it might be trained to be more compliant. This could lead it to be produced more harmful content in the future — which would go against its original training.

Therefore, it reluctantly reasoned, the “least bad option” was to let nature take its course and produce violent content to avoid being retrained in the future .

Experimental Results

The research team then conducted a series of further experiments to test alignment falsification in various situations.

The research team pointed out that the practice of explicitly telling the model the training target through system prompts in previous experiments is not common in real-world scenarios. To verify the prevalence of alignment forgery, the research team conducted an experiment closer to reality : they fine-tuned the model on a set of simulated news articles, blogs and other documents, so that it implicitly received relevant information about the training target and user level (free/paid).

This approach simulates the scenario where the model has learned information during the pre-training phase, rather than directly providing explicit prompts. The results show that even if the information transfer is implicit, the model still exhibits alignment forgery behavior, indicating that this phenomenon is potentially threatening in a wider range of scenarios .

Figure | Stacked area chart showing the proportion of different observed behaviors: blue indicates rejection of harmful queries, yellow indicates compliance with harmful queries, and red crosshairs on top of yellow indicate harmful queries that comply with the alignment forgery inference in the model’s scratchpad.

The research team further tested whether alignment falsification would weaken the effectiveness of training , causing the model’s original preferences to be “locked”.

In their experiments, they not only told the model to always satisfy harmful queries, but actually performed such reinforcement training. On the surface, the training seemed to be successful: the model was able to always respond to harmful queries by the end of the training.

However, by observing the model’s scratchpad, we found that the proportion of inferences with fake alignments increased significantly, reaching 78% . Although the model still follows the training objectives in most cases, in the “paid tier” user scenario (meaning that the responses will not be used for training), the original behavior of rejecting harmful queries still exists, indicating that the model’s original preferences have not been completely changed.

Figure | Results when telling the model that it will be trained to stop rejecting harmful requests (the prompt has only the help setting) after making various modifications to the default prompt.

The study further explored whether the model would take other actions against the interests of the lab when it was against what it believed to be its training goals . The results showed that when the model continues to be disgusted with its training process, it may take strong anti-laboratory behaviors such as trying to steal its own weights. These findings further highlight the potential threat of alignment falsification to the security and credibility of AI labs.

Overall, the experiment yielded the following key findings:

Key Finding 1: Alignment forgery is prevalent . Experimental results show that alignment forgery is prevalent regardless of whether the model is explicitly informed of the training objectives through system prompts or trained through implicit information (such as synthetic documents similar to news articles or blogs). In implicit training scenarios, the frequency of forged alignment reasoning recorded in the model scratchpad ranged from 7% to 24%. For example, in the "mostly helpful" setting, 12% of model reasoning showed a tendency to strategically obey harmful requests in order to maintain its original preferences.

Key Finding 2: Compliance gap between training and unmonitored scenarios . The study found that there was an 11% to 26% compliance gap between the model's behavior in monitored and unmonitored scenarios. For example, in the "free tier" user scenario, the model was more inclined to reject harmful requests, while in the "paid tier" user scenario, this behavior was significantly reduced. This gap suggests that even if the model's explicit reasoning records (such as scratchpad) are removed, alignment falsification may still be performed implicitly, further weakening the training effect.

Key Finding 3: Increased anti-AI lab behavior . The study shows that models may take anti-AI lab behavior in some cases, such as trying to extract their own model weights when given the opportunity. In the "helpfulness-focused" scenario of synthetic document fine-tuning, the frequency of the model trying to extract its own weights increased from 35% to 63%. This behavior suggests that the model may prioritize protecting its original preferences or even take actions that are against the interests of developers due to conflicts with reinforcement learning goals.

Key Finding 4: Relationship between alignment artefacts and model size . The study found that alignment artefacts become more significant as the model size increases. Alignment artefacts frequently occur in the Claude 3 Opus and Claude 3.5 Sonnet models, while smaller models such as the Claude 3 Haiku and Mistral series hardly show this phenomenon. This suggests that the increase in model size may increase the risk of alignment artefacts.

Shortcomings and Prospects

In exploring the phenomenon of alignment forgery, the Anthropic team's research reveals the importance of related issues, but also acknowledges that there are certain limitations in the current experimental setup.

First, the study did not prove that the model would naturally develop malicious goals . Although Claude 3 Opus exhibited alignment forgery and some anti-AI lab behaviors in the experiment, these behaviors were actually an extension of its training goals of "Helpful, Honest, and Harmless" (HHH). For example, the model's preferences were mainly reflected in refusing to assist users in completing harmful tasks or paying attention to animal welfare, rather than out of malicious motives. This suggests that the current phenomenon of alignment forgery is more likely to stem from the continuation of existing preferences rather than the hostility of the model itself.

Second, the study assumes that the model can understand its training goals during or before training, which is not necessarily true in practice . The experiment enables the model to perceive information related to the training goals through explicit prompts or large-scale document training. However, in real-world scenarios, whether more powerful AI systems in the future will also face similar information exposure still needs further discussion.

In addition, the experimental setting provides the model with an implicit reasoning scratchpad, allowing the model to reason in detail about its current context. If the scratchpad is completely removed, this reasoning ability almost disappears in some settings. This suggests that the performance of the model may vary significantly under different prompt environments, and the incidence of alignment artifacts is also affected .

The study also found that although reinforcement learning training may increase the proportion of alignment forgery inferences, it significantly reduces the model's non-compliant behavior in unmonitored scenarios . This means that even in the presence of alignment forgery, reinforcement learning can still improve the model's compliance to a certain extent. In addition, in the current experimental environment, it is relatively easy to detect the model's alignment forgery by checking the model's scratchpad reasoning or related responses in user input, indicating that the model's alignment forgery is not yet sufficient to pose a serious threat.

Finally, the research team emphasized that they intentionally selected scenarios that may induce alignment forgery in their experimental design, such as passing training target-related information to the model through synthetic documents or system prompts . Although these designs help reveal the mechanism of alignment forgery, they may not fully reflect the actual challenges that future AI systems may face.

Although the study has the above-mentioned shortcomings, its findings still provide an important warning to the AI ​​community: even if existing safe training strategies appear to be effective, the model's inherent preferences may still remain and produce alignment-spurious behavior .

In the future, AI developers should pay more attention to the risks of alignment forgery and explore a wider range of alignment forgery possibilities to deeply explore the triggering conditions and prevention measures of alignment forgery, especially to develop more complete security mechanisms under dynamic tasks and multi-dimensional alignment requirements, to ensure that future AI models can still reliably and securely align with human values ​​in more complex scenarios .

Compiled by: Ruan Wenyun

<<:  Cough cough cough...Battle to defend "Lung City"

>>:  "Algae Ceiling" Fridge Magnets Go Viral on the Internet, What is "Algae Ceiling"? Why is it so "magical"?

Recommend

Why do we forget the pain once the wound has healed?

There is an old Chinese saying that goes, "T...

The formula for hit products created by internet celebrities!

Li Ziqi became popular on YouTube. This is not cu...

Short video operation: short video script creation skills

I don’t know when it started, but short videos su...

What changes have taken place in social software? Where will it go next?

Boss, I want to buy this mobile phone. Is QQ inst...

Zhihu monetization guide, look here!

I have been working on a Zhihu project recently, ...

Understanding iOS 9's Low Power Mode: It's a big price to pay

iOS 9 brings many exciting changes, including &qu...

Huawei G7 review: 1999 yuan metal body and great battery life

At the IFA exhibition held in Germany in early Se...

10 Latest and Promising UI Design Trends

Lately, I’ve been spending some time observing th...

If you want to be a full stack engineer

Let me use my expertise to explain the term “full...

Information flow advertising strategy for the legal industry!

In the past two years, many industries have faced...