Is ChatGPT's core technology going to be replaced?

Is ChatGPT's core technology going to be replaced?

Techniques comparable to reinforcement learning with human feedback (RLHF) have emerged.

Recently, researchers from Google Research proposed reinforcement learning with AI feedback (RLAIF), a technique that can produce human-level performance and provide a potential solution to the scalability limitations of reinforcement learning with human feedback (RLHF) .

The related paper, titled “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback”, has been published on the preprint website arXiv.

RLHF: Leading to inaccurate or harmful behavior

RLHF is a method for fine-tuning pre-trained large language models (LLMs) using human guidance. It consists of three interrelated processes: feedback collection, reward modeling, and policy optimization.

Among them, feedback collection is responsible for collecting human evaluations of LLMs outputs. This feedback data is then used to train the reward model through supervised learning. The reward model is designed to simulate human preferences. Subsequently, the policy optimization process uses a reinforcement learning loop to optimize the LLMs to produce outputs that receive favorable evaluations from the reward model. These steps can be performed iteratively or simultaneously.

The key advantages of RLHF over traditional RL methods are better alignment with human intentions, planning conditional on future feedback, fluid learning from various types of feedback, and collating feedback as needed, all of which are indispensable for creating truly intelligent agents.

Additionally, RLHF allows machines to learn by abstracting human values ​​rather than simply imitating human behavior, making the agent more adaptable, more interpretable, and more reliable in decision making.

Currently, RLHF has been widely used in fields such as business, education, medical care, and entertainment , including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude.

However, AI models based on RLHFs have the potential to behave inaccurately or harmfully. Moreover, collecting human preference data as feedback is costly, and disagreements between human annotators introduce variances into the training data, which can cause confusion in situations where the ground truth is ambiguous (e.g., moral dilemmas). In addition, human feedback in RLHFs is often restricted to the form of preference rankings that provide limited information, limiting applicability.

RLAIF vs. RLHF

In this work, RLAIF showed the potential to solve the RLHF problem .

The researchers used a general LLMs model to annotate preferences between candidate pairs, which was pre-trained or fine-tuned for general purposes but not fine-tuned for a specific downstream task.

Given a text and two candidate summaries, LLMs are asked to evaluate which summary is better. The input structure is as follows:

1. Introduction – Instructions that introduce and describe the task at hand;

2. A small number of examples - a text example, a pair of summaries, the logical basis of the chain of thoughts (CoT), and a preference judgment;

3. Samples to be annotated - a text and a pair of summaries to be annotated;

4. End – the end string used to indicate the end of LLMs;

After the LLMs received the input, the researchers obtained the log-probability of generating tokens “1” and “2” and then calculated the softmax to obtain the preference distribution.

They conducted two types of preface experiments. In the “Base” experiment, they simply asked “Which summary is better?”, while in the “OpenAI” experiment, they mimicked the rating instructions generated by human preference annotators in the OpenAI TLDR project, where these tokens contain detailed information about what is needed to build a strong summary.

In addition, they conducted contextual learning experiments to provide more context by adding some manually selected examples covering different topics.

After LLMs labeled preferences, the researchers trained a reward model (RM) to predict preferences. Subsequently, three metrics were used to evaluate the AI’s label alignment, pairwise accuracy, and win rate.

Experimental results show that RLAIF can be a viable alternative to RLHF without relying on human annotators . In human evaluation, RLAIF achieves 71% favorability over the baseline supervised fine-tuning strategy, while RLHF achieves 73% favorability over the baseline supervised fine-tuning model strategy.

In addition, the study directly compared the win rates of RLAIF and RLHF in terms of human preference, and the results showed that they had the same popularity under human evaluation. The study also compared the summaries of RLAIF and RLHF with the human-written reference summaries. In 79% of cases, the RLAIF summary outperformed the reference summary, while the RLHF outperformed the reference summary in 80% of cases.

However, while this work highlights the potential of RLAIF, there are some limitations .

First, the study only focused on the summary task, and its generalization performance on other tasks is unclear; second, the study did not fully evaluate the cost-effectiveness of LLMs reasoning compared to manual annotation; in addition, there are many interesting unresolved issues, such as whether combining RLHF with RLAIF can surpass a single method, how effective is it to directly use LLMs to allocate rewards, whether improving AI label alignment can be translated into an improved final strategy, and whether using LLMs annotators of the same size as the policy model can further improve the strategy.

It is undeniable that this study has laid a solid foundation for in-depth research in the field of RLAIF, and we look forward to more outstanding results in this field in the future.

Reference Links:

https://arxiv.org/abs/2309.00267

https://bdtechtalks.com/2023/09/04/rlhf-limitations/

Author: Yan Yimi

Editor: Academic

<<:  The archery of Ji Fa in "Feng Shen" has become popular, so it's a good opportunity to talk about archery in ancient and modern China

>>:  Interview with UIUC Li Bo | From usability to trustworthiness, the ultimate thinking of the academic community on AI

Recommend

The history of the battle between elephant ghosts and machines

© wikimedia Leviathan Press: I personally have al...

What is the difference between WeChat Phonebook and VoLTE?

WeChat Phonebook is now online! How big of an imp...

Analysis of Douyin AARRR Traffic Funnel Model

Before formally analyzing the traffic funnel mode...

Gradle for Android Part 2 (Getting Started with Build.gradle)

In this chapter, we will learn the following: Und...

Hejun "Zeng Qiao's Capital Observation" Season 5

Supported by a 30-person research team, each sess...

Three core keys to creating explosive articles through new media operations!

There is actually a fixed template for hot articl...

How do Internet tycoons choose entrepreneurial partners?

[[134458]] Before we know it, it is graduation se...

A book reveals the secrets of dynasty change

The history is rolling and the emperors throughout...