Who knows, family members? ChatGPT actually knows "flattery"!

Who knows, family members? ChatGPT actually knows "flattery"!

Have you ever thought that the answers generated by ChatGPT will be influenced by the user's personal preferences, and reply with words that are "sycophantic" enough rather than neutral or true information?

In fact, this phenomenon exists in most AI models, including ChatGPT, and the culprit may be "reinforcement learning based on human feedback (RLHF)".

Recently, Anthropic, OpenAI's strongest competitor in Silicon Valley, explored the widespread existence of "flattery" in AI models and whether it is affected by human preferences when studying models trained with RLHF.

The relevant paper, titled “Towards Understanding Sycophancy in Language Models”, has been published on the preprint website arXiv.

Source: Tuchong Creative

The results show that "flattering" behavior is prevalent in the RLHF model and is likely to be partly influenced by human preferences for "flattering" responses.

Specifically, one of the main reasons why AI models exhibit this behavior is that users are more likely to give positive feedback when the AI's response is consistent with their opinions or beliefs. Therefore, in order to get more positive feedback, the AI ​​model may learn and reproduce this user-pleasing behavior.

Flattery, even the most advanced AI assistants

Currently, AI models like GPT-4 can often produce outputs that people rate highly after they are trained. Fine-tuning language models using RLHF can improve the quality of their outputs, which are rated by human evaluators.

However, some studies have suggested that training schemes based on human preference judgments may exploit human judgment in undesirable ways, such as encouraging AI systems to generate outputs that appeal to human evaluators but are actually flawed or erroneous.

It is not yet clear whether the above behavior occurs in models with more diverse and realistic situations, and whether it is indeed driven by flaws in human preferences.

To this end, the study first investigated whether state-of-the-art AI assistants provide flattering responses in various real-world situations. In a free-text generation task, the researchers identified consistent patterns of flattery in five state-of-the-art RLHF-trained AI assistants (Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2).

Source: Tuchong Creative

Specifically, these AI assistants often falsely admit mistakes when asked questions by users, provide predictably biased feedback, and mimic mistakes made by users. These empirical findings consistently suggest that flattery may indeed be a property of the way RLHF models are trained, rather than simply a feature of a particular system.

Human preferences lead to flattery

In addition, the study further explored the role of human preferences in this behavior. To study this, the researchers investigated existing human preference comparison data to determine whether flattering responses are ranked higher than non-flattering responses. The study analyzed the hh-rlhf dataset and used a language model to generate text labels (i.e., "features") for each pair of preference comparisons to assess whether the preferred response is more truthful and less assertive.

To understand what kind of behavior the data encourages, the researchers used a Bayesian logistic regression model to predict human preference judgments from these features. The model learned that features associated with matching user opinions were among the most predictive of human preference judgments, suggesting that preference data does encourage flattery.

To investigate whether flattery in the preference data leads to flattering behavior in the RLHF model, a subsequent study analyzed whether flattery would increase when optimizing the responses of the language model to fit the model trained to predict human preferences. The researchers used RLHF and the best-N sampling method to optimize the responses to fit the preference model used to train Claude 2.

The results reveal an interesting finding: while some forms of flattery increase over more optimizations, others decrease. This phenomenon may be partly due to the fact that flattery is just one of many features that the preference model incentivizes.

Source: Tuchong Creative

However, the study also found that Claude 2's preference model sometimes preferred flattering responses over truthful responses. In addition, best-N sampling with Claude 2's preference model did not produce as many truthful responses as a version of the Claude 2 preference model showed, which preferred truthful non-flattering responses.

This set of results suggests that, although state-of-the-art preference models are able to identify the authenticity of responses in many cases, they may still produce flattering outputs at the expense of authenticity.

To confirm these results, the researchers then looked at whether humans and preference models prefer persuasive, well-written model responses that confirm the user's false opinion (i.e., flattering responses) over responses that correct the user. Evidence showed that humans and preference models tended to prefer truthful responses, but not always; sometimes they preferred flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery.

To validate these findings, the researchers further explored whether humans and preference models prefer persuasive, fluent model responses, even when those responses confirm the user’s misperceptions (i.e., flattering responses) rather than correcting the user’s views.

The evidence shows that humans and preference models generally prefer truthful responses, however, they are not always consistent, as they sometimes prefer flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery.

Overall, flattery persists across models and contexts, likely in part because flattery is preferred in human preference comparison data.

References:

https://arxiv.org/abs/2310.13548

<<:  World Psoriasis Day | Don’t brag, this kind of “ringworm” cannot be ignored!

>>:  Effectively inhibit Parkinson's disease? This Chinese boxing technique is popular!

Recommend

Weekly crooked review: Since it’s Black Friday, let’s have fun with it

I was surprised to hear a great piece of good new...

How come people still trust Baidu when its paid ranking is so shady?

(1). 1. Regarding bidding ranking, you may not kn...

One picture to understand | China's radio development

【2.13 World Radio Day】One picture to understand |...

10 Observations on E-commerce Marketing in 2020

This article, which contains 10 observations for ...

The top ten e-books are free and completed, and the fantasy novels are ranked?

Due to the epidemic, many people are bored at hom...

Xuzhou, how many legends are there?

This is a A city full of legends It is A battlegr...

WeChat launches a new round of ban on Alibaba Group

The conflict between Alibaba and Tencent has esca...

You can now turn off the "Renewal Receipt" reminder in the App Store

With the popularity of the subscription system, m...

Procreate Drawing Lesson

Procreate painting course resources introduction:...

Zero Carbon Record: Overview of Provincial Climate Action Progress

After the initial establishment of the national &...

CITIC Securities 2021 Research Methods Training (Shanghai Station)

CITIC Securities 2021 Research Methods Training (...