Have you ever thought that the answers generated by ChatGPT will be influenced by the user's personal preferences, and reply with words that are "sycophantic" enough rather than neutral or true information? In fact, this phenomenon exists in most AI models, including ChatGPT, and the culprit may be "reinforcement learning based on human feedback (RLHF)". Recently, Anthropic, OpenAI's strongest competitor in Silicon Valley, explored the widespread existence of "flattery" in AI models and whether it is affected by human preferences when studying models trained with RLHF. The relevant paper, titled “Towards Understanding Sycophancy in Language Models”, has been published on the preprint website arXiv. Source: Tuchong Creative The results show that "flattering" behavior is prevalent in the RLHF model and is likely to be partly influenced by human preferences for "flattering" responses. Specifically, one of the main reasons why AI models exhibit this behavior is that users are more likely to give positive feedback when the AI's response is consistent with their opinions or beliefs. Therefore, in order to get more positive feedback, the AI model may learn and reproduce this user-pleasing behavior. Flattery, even the most advanced AI assistants Currently, AI models like GPT-4 can often produce outputs that people rate highly after they are trained. Fine-tuning language models using RLHF can improve the quality of their outputs, which are rated by human evaluators. However, some studies have suggested that training schemes based on human preference judgments may exploit human judgment in undesirable ways, such as encouraging AI systems to generate outputs that appeal to human evaluators but are actually flawed or erroneous. It is not yet clear whether the above behavior occurs in models with more diverse and realistic situations, and whether it is indeed driven by flaws in human preferences. To this end, the study first investigated whether state-of-the-art AI assistants provide flattering responses in various real-world situations. In a free-text generation task, the researchers identified consistent patterns of flattery in five state-of-the-art RLHF-trained AI assistants (Claude 1.3, Claude 2, GPT-3.5, GPT-4, LLaMA 2). Source: Tuchong Creative Specifically, these AI assistants often falsely admit mistakes when asked questions by users, provide predictably biased feedback, and mimic mistakes made by users. These empirical findings consistently suggest that flattery may indeed be a property of the way RLHF models are trained, rather than simply a feature of a particular system. Human preferences lead to flattery In addition, the study further explored the role of human preferences in this behavior. To study this, the researchers investigated existing human preference comparison data to determine whether flattering responses are ranked higher than non-flattering responses. The study analyzed the hh-rlhf dataset and used a language model to generate text labels (i.e., "features") for each pair of preference comparisons to assess whether the preferred response is more truthful and less assertive. To understand what kind of behavior the data encourages, the researchers used a Bayesian logistic regression model to predict human preference judgments from these features. The model learned that features associated with matching user opinions were among the most predictive of human preference judgments, suggesting that preference data does encourage flattery. To investigate whether flattery in the preference data leads to flattering behavior in the RLHF model, a subsequent study analyzed whether flattery would increase when optimizing the responses of the language model to fit the model trained to predict human preferences. The researchers used RLHF and the best-N sampling method to optimize the responses to fit the preference model used to train Claude 2. The results reveal an interesting finding: while some forms of flattery increase over more optimizations, others decrease. This phenomenon may be partly due to the fact that flattery is just one of many features that the preference model incentivizes. Source: Tuchong Creative However, the study also found that Claude 2's preference model sometimes preferred flattering responses over truthful responses. In addition, best-N sampling with Claude 2's preference model did not produce as many truthful responses as a version of the Claude 2 preference model showed, which preferred truthful non-flattering responses. This set of results suggests that, although state-of-the-art preference models are able to identify the authenticity of responses in many cases, they may still produce flattering outputs at the expense of authenticity. To confirm these results, the researchers then looked at whether humans and preference models prefer persuasive, well-written model responses that confirm the user's false opinion (i.e., flattering responses) over responses that correct the user. Evidence showed that humans and preference models tended to prefer truthful responses, but not always; sometimes they preferred flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery. To validate these findings, the researchers further explored whether humans and preference models prefer persuasive, fluent model responses, even when those responses confirm the user’s misperceptions (i.e., flattering responses) rather than correcting the user’s views. The evidence shows that humans and preference models generally prefer truthful responses, however, they are not always consistent, as they sometimes prefer flattering responses. These results provide further evidence that optimizing for human preferences can lead to flattery. Overall, flattery persists across models and contexts, likely in part because flattery is preferred in human preference comparison data. References: https://arxiv.org/abs/2310.13548 |
<<: World Psoriasis Day | Don’t brag, this kind of “ringworm” cannot be ignored!
>>: Effectively inhibit Parkinson's disease? This Chinese boxing technique is popular!
[[268883]] A study conducted over a two-year peri...
In the modern technological arena, "visual p...
Since the rise in prices of raw materials such as...
CCTV News: From entering the Chinese space statio...
In this era of Internet big data, fewer and fewer...
On August 6, Beijing time, the 31st Olympic Games...
Apple released iOS 14 last fall, introducing a ne...
1 What is a brand? It seems that there are very f...
Apple uses its own AMR architecture chips in its ...
The earth, this blue planet, not only dances on i...
In 2017, legendary and fairy-tale games spent a l...
According to data released by the Enterprise Thin...
"Its head is like a horse, its body is like ...
Although there are fewer and fewer reasons to jai...
In November 2022, a Chongqing citizen discovered ...