Who knows, family members? ChatGPT actually knows "flattery"!

Have you ever thought that the answers generated by ChatGPT will be influenced by the user's personal preferences, and reply with words that are "sycophantic" enough rather than neutral or true information?

In fact, this phenomenon exists in most AI models, including ChatGPT, and the culprit may be "reinforcement learning based on human feedback (RLHF)".

Recently, Anthropic, OpenAI's strongest competitor in Silicon Valley, explored the widespread existence of "flattery" in AI models and whether it is affected by human preferences when studying models trained with RLHF.

The relevant paper, titled “Towards Understanding Sycophancy in Language Models”, has been published on the preprint website arXiv.

Source: Tuchong Creative

The results show that "flattering" behavior is prevalent in the RLHF model and is likely to be partly influenced by human preferences for "flattering" responses.

Specifically, one of the main reasons why AI models exhibit this behavior is that users are more likely to give positive feedback when the AI's response is consistent with their opinions or beliefs. Therefore, in order to get more positive feedback, the AI model may learn and reproduce this user-pleasing behavior.

Flattery, even the most advanced AI assistants

Currently, AI models like GPT-4 can often produce outputs that people rate highly after they are trained. Fine-tuning language models using RLHF can improve the quality of their outputs, which are rated by human evaluators.

However, some studies have suggested that training schemes based on human preference judgments may exploit human judgment in undesirable ways, such as encouraging AI systems to generate outputs that appeal to human evaluators but are actually flawed or erroneous.

It is not yet clear whether the above behavior occurs in models with more diverse and realistic situations, and whether it is indeed driven by flaws in human preferences.

To this end, the study first investigated whether state-of-the-art AI assistants provide flattering responses in various real-world situations. In a free-text generation task, the researchers identified consistent patterns of flattery in five state-of-the-art RLHF-trained AI assistants (Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2).

Source: Tuchong Creative

Specifically, these AI assistants often falsely admit mistakes when asked questions by users, provide predictably biased feedback, and mimic mistakes made by users. These empirical findings consistently suggest that flattery may indeed be a property of the way RLHF models are trained, rather than simply a feature of a particular system.

Human preferences lead to flattery

In addition, the study further explored the role of human preferences in this behavior. To study this, the researchers investigated existing human preference comparison data to determine whether flattering responses are ranked higher than non-flattering responses. The study analyzed the hh-rlhf dataset and used a language model to generate text labels (i.e., "features") for each pair of preference comparisons to assess whether the preferred response is more truthful and less assertive.

To understand what kind of behavior the data encourages, the researchers used a Bayesian logistic regression model to predict human preference judgments from these features. The model learned that features associated with matching user opinions were among the most predictive of human preference judgments, suggesting that preference data does encourage flattery.

To investigate whether flattery in the preference data leads to flattering behavior in the RLHF model, a subsequent study analyzed whether flattery would increase when optimizing the responses of the language model to fit the model trained to predict human preferences. The researchers used RLHF and the best-N sampling method to optimize the responses to fit the preference model used to train Claude 2.

The results reveal an interesting finding: while some forms of flattery increase over more optimizations, others decrease. This phenomenon may be partly due to the fact that flattery is just one of many features that the preference model incentivizes.

Source: Tuchong Creative

However, the study also found that Claude 2's preference model sometimes preferred flattering responses over truthful responses. In addition, best-N sampling with Claude 2's preference model did not produce as many truthful responses as a version of the Claude 2 preference model showed, which preferred truthful non-flattering responses.

This set of results suggests that, although state-of-the-art preference models are able to identify the authenticity of responses in many cases, they may still produce flattering outputs at the expense of authenticity.

To confirm these results, the researchers then looked at whether humans and preference models prefer persuasive, well-written model responses that confirm the user's false opinion (i.e., flattering responses) over responses that correct the user. Evidence showed that humans and preference models tended to prefer truthful responses, but not always; sometimes they preferred flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery.

To validate these findings, the researchers further explored whether humans and preference models prefer persuasive, fluent model responses, even when those responses confirm the user’s misperceptions (i.e., flattering responses) rather than correcting the user’s views.

The evidence shows that humans and preference models generally prefer truthful responses, however, they are not always consistent, as they sometimes prefer flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery.

Overall, flattery persists across models and contexts, likely in part because flattery is preferred in human preference comparison data.

References:

https://arxiv.org/abs/2310.13548

<<: World Psoriasis Day | Don’t brag, this kind of “ringworm” cannot be ignored!

>>: Effectively inhibit Parkinson's disease? This Chinese boxing technique is popular!

Congratulations! It’s the first anniversary of Chang’e 5’s launch!

Baidu Apollo autonomous driving first experience: cutting-edge technology that makes people feel nothing is the best AI intelligence

Blog

Animals can also become "addicts"! How do drugs make the brain addicted?

How to promote and attract traffic to Guangzhou mini programs? How to promote and attract traffic to Guangzhou mini programs?

Recently, many customers have asked me what to do ...

Who knows, family members? ChatGPT actually knows "flattery"!

Congratulations! It’s the first anniversary of Chang’e 5’s launch!

XY Apple Assistant: Count down the practical features that will be added to iOS 8.3

Yan Jie "Stretching Light Yoga" 14-day fast-acting metabolic slimming yoga tutorial video

If you want to advertise well, these cases are indispensable

When we see it everywhere, we know spring has really come.

iQiyi’s advertising and charging standards

Baidu Apollo autonomous driving first experience: cutting-edge technology that makes people feel nothing is the best AI intelligence

Animals can also become "addicts"! How do drugs make the brain addicted?

Information flow advertising played an indispensable role in the 4 billion box office of "Wolf Warrior 2"!

My Brand Methodology

Recommend

The person who commands the satellite in the sky only listens to her orders!

User growth operation: How to achieve explosive growth of products?

Spanning billions of miles, how did this pair of "twins" knock on the "door" to interstellar space?

Are coughing and runny nose signs of COVID-19 recovery? Huaxi doctors teach you 10 effective ways to relieve it

How to establish quantitative operational indicators for B-side operations?

How to promote and attract traffic to Guangzhou mini programs? How to promote and attract traffic to Guangzhou mini programs?

Do you know all the tricks about creative design of information flow?

2022 Open Traffic Password Baidu Cloud Download

In-depth understanding of the Android graphics system

How to use Tik Tok app? What is the delivery method?

Shenzhou 17 will be launched into space next month | Why are they called “astronauts” abroad and “astronauts” in China?

Volkswagen has spent $17.5 billion in the U.S. on emissions cheating, but it's not the end

Tips for optimizing information flow advertising promotion!

One reason is enough to advise you to eat more apples!

Wang Yuquan · Frontier Technology Training Camp Baidu Cloud Download