Is ChatGPT's core technology going to be replaced?

Techniques comparable to reinforcement learning with human feedback (RLHF) have emerged.

Recently, researchers from Google Research proposed reinforcement learning with AI feedback (RLAIF), a technique that can produce human-level performance and provide a potential solution to the scalability limitations of reinforcement learning with human feedback (RLHF) .

The related paper, titled “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback”, has been published on the preprint website arXiv.

RLHF: Leading to inaccurate or harmful behavior

RLHF is a method for fine-tuning pre-trained large language models (LLMs) using human guidance. It consists of three interrelated processes: feedback collection, reward modeling, and policy optimization.

Among them, feedback collection is responsible for collecting human evaluations of LLMs outputs. This feedback data is then used to train the reward model through supervised learning. The reward model is designed to simulate human preferences. Subsequently, the policy optimization process uses a reinforcement learning loop to optimize the LLMs to produce outputs that receive favorable evaluations from the reward model. These steps can be performed iteratively or simultaneously.

The key advantages of RLHF over traditional RL methods are better alignment with human intentions, planning conditional on future feedback, fluid learning from various types of feedback, and collating feedback as needed, all of which are indispensable for creating truly intelligent agents.

Additionally, RLHF allows machines to learn by abstracting human values rather than simply imitating human behavior, making the agent more adaptable, more interpretable, and more reliable in decision making.

Currently, RLHF has been widely used in fields such as business, education, medical care, and entertainment , including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude.

However, AI models based on RLHFs have the potential to behave inaccurately or harmfully. Moreover, collecting human preference data as feedback is costly, and disagreements between human annotators introduce variances into the training data, which can cause confusion in situations where the ground truth is ambiguous (e.g., moral dilemmas). In addition, human feedback in RLHFs is often restricted to the form of preference rankings that provide limited information, limiting applicability.

RLAIF vs. RLHF

In this work, RLAIF showed the potential to solve the RLHF problem .

The researchers used a general LLMs model to annotate preferences between candidate pairs, which was pre-trained or fine-tuned for general purposes but not fine-tuned for a specific downstream task.

Given a text and two candidate summaries, LLMs are asked to evaluate which summary is better. The input structure is as follows:

1. Introduction – Instructions that introduce and describe the task at hand;

2. A small number of examples - a text example, a pair of summaries, the logical basis of the chain of thoughts (CoT), and a preference judgment;

3. Samples to be annotated - a text and a pair of summaries to be annotated;

4. End – the end string used to indicate the end of LLMs;

After the LLMs received the input, the researchers obtained the log-probability of generating tokens “1” and “2” and then calculated the softmax to obtain the preference distribution.

They conducted two types of preface experiments. In the “Base” experiment, they simply asked “Which summary is better?”, while in the “OpenAI” experiment, they mimicked the rating instructions generated by human preference annotators in the OpenAI TLDR project, where these tokens contain detailed information about what is needed to build a strong summary.

In addition, they conducted contextual learning experiments to provide more context by adding some manually selected examples covering different topics.

After LLMs labeled preferences, the researchers trained a reward model (RM) to predict preferences. Subsequently, three metrics were used to evaluate the AI’s label alignment, pairwise accuracy, and win rate.

Experimental results show that RLAIF can be a viable alternative to RLHF without relying on human annotators . In human evaluation, RLAIF achieves 71% favorability over the baseline supervised fine-tuning strategy, while RLHF achieves 73% favorability over the baseline supervised fine-tuning model strategy.

In addition, the study directly compared the win rates of RLAIF and RLHF in terms of human preference, and the results showed that they had the same popularity under human evaluation. The study also compared the summaries of RLAIF and RLHF with the human-written reference summaries. In 79% of cases, the RLAIF summary outperformed the reference summary, while the RLHF outperformed the reference summary in 80% of cases.

However, while this work highlights the potential of RLAIF, there are some limitations .

First, the study only focused on the summary task, and its generalization performance on other tasks is unclear; second, the study did not fully evaluate the cost-effectiveness of LLMs reasoning compared to manual annotation; in addition, there are many interesting unresolved issues, such as whether combining RLHF with RLAIF can surpass a single method, how effective is it to directly use LLMs to allocate rewards, whether improving AI label alignment can be translated into an improved final strategy, and whether using LLMs annotators of the same size as the policy model can further improve the strategy.

It is undeniable that this study has laid a solid foundation for in-depth research in the field of RLAIF, and we look forward to more outstanding results in this field in the future.

Reference Links:

https://arxiv.org/abs/2309.00267

https://bdtechtalks.com/2023/09/04/rlhf-limitations/

Author: Yan Yimi

Editor: Academic

<<: The archery of Ji Fa in "Feng Shen" has become popular, so it's a good opportunity to talk about archery in ancient and modern China

>>: Interview with UIUC Li Bo | From usability to trustworthiness, the ultimate thinking of the academic community on AI

Scientists develop new coating to protect the largest million-kilowatt peak-shaving unit in Northwest China

Is ChatGPT's core technology going to be replaced?

Scientists develop new coating to protect the largest million-kilowatt peak-shaving unit in Northwest China

Why is it that no one in China can sing the lyrics of this song correctly, even though all Chinese people have sung it?

Luo Yao's Graphic Design Thinking Course ended in March 2020 [HD video only]

Is there always a strong smell in your car? Here are 5 ways to try!

iQIYI CTO Tang Xing: The Dao of Heaven and Earth, the General's Law, and the Management Theory of "The Art of War"

The college entrance examination is approaching. How can parents provide good logistical support?

How to choose soft article promotion channels to maximize the effect?

Several existing problems and solutions for App Store search

Ten thousand words of practical information | How to build a user life cycle?

After reviewing Google’s “Guess the Painting” app, I found 5 reasons why it became so popular!

Recommend

Analysis of major mainstream information flow promotion channels in 2019!

Why are curved screen phones so common, but many people still prefer flat screen phones?

The long-awaited "WeChat Customer Service" is here. The official website is now online

China's deserts selected as World Heritage Sites: Uncovering the "Five Big Mysteries" of the Badain Jaran Desert

What! Can we see the Northern Lights in the country?

Can't tell the difference between elk, moose and reindeer? What kind of deer does the English word "elk" refer to?

The higher the alcohol concentration, the better the disinfection effect? This statement is really wrong!

Accounting for as much as 1/3 of the cost, is the laser radar used in driverless cars a "big scam"?

How to analyze user thinking and do good brand marketing?

There are fake public WiFi connections, will you regret it? Don’t ignore network security during the Spring Festival!

What does a foldable iPhone look like? iPhone X Fold concept

13 golden rules for public relations crisis management!

Microsoft Ventures Accelerator and Fenxiang Sales and Testin Cloud Testing Strategic Cooperation

Is APP dead? Is mini program promotion coming?

The first test of the TV version of the movie-like strategy game "Three Kingdoms Heroes"