AI training AI? Maybe it will become dumber

Written by Ma Xuewei Edited by Paige

Preface

Currently, in the increasingly hot large model industry, Scaling Law has been proven to still work.

The question is, once high-quality data generated by humans (such as books, articles, photos, videos, etc.) is exhausted, how will large model training proceed ?

At present, a method that is highly expected is to "train itself with the data generated by the large model itself." In fact, if the training data of the subsequent model is also obtained from the network, it will inevitably use the data generated by the previous model.

However, a research team from the University of Oxford and the University of Cambridge and their collaborators have "poured cold water" on this idea.

They came to the conclusion that when the model uses its own generated content during training, irreversible defects will occur and it will gradually forget the true data distribution, resulting in a decline in model performance.

That is, "Model Collapse" .

The related research paper, titled "AI models collapse when trained on recursively generated data", has been published in the authoritative scientific journal Nature.

But they also said that it is not impossible to use data generated by an old model to train a new model, but the data must be strictly filtered.

In a news and opinion article published at the same time, Emily Wenger from Duke University said, "The authors of the paper did not consider what happens when the model is trained on data generated by other models. They focused on the results of the model training on its own output. It remains to be seen whether a model will collapse when trained on the output of other models. Therefore, the next challenge will be to figure out the mechanism by which the model collapse occurs. "

What is model collapse?

Essentially, “model collapse” occurs when data generated by a large model ends up contaminating the training set of subsequent models.

Small models like GMMs and VAEs are typically trained from scratch, while LLMs are very expensive to retrain and are therefore typically initialized using models pre-trained on large text corpora like BERT4, RoBERTa5, or GPT-2 and then fine-tuned for various downstream tasks.

So what happens when a language model is in turn fine-tuned using data generated by other models?

To this end, the research team conducted experiments using the OPT-125m language model and fine-tuned it using the wikitext2 dataset. The experimental results show that the model collapse phenomenon occurs regardless of whether the original data is retained. As the number of iterations increases, the number of low-perplexity samples in the samples generated by the model begins to accumulate, indicating that the model begins to forget the tail events in the real data distribution. Moreover, compared with the original model, the performance of the subsequent iterative model has declined, which is manifested as an increase in perplexity. In addition, the data generated by the model contains a large number of repeated phrases.

Figure | Example of text output for an OPT-125m model affected by model collapse - the model degrades between generations.

Imagine a generative AI model tasked with generating images of dogs. The AI model will tend to reproduce the most common dog breeds in the training data, so it might overrepresent Golden Retrievers and not French Bulldogs. This problem is exacerbated if a subsequent model is trained on an AI-generated dataset that overrepresents Golden Retrievers. After enough rounds of overrepresentation of Golden Retrievers, the model will forget that less popular breeds like French Bulldogs exist and only generate images of Golden Retrievers. Eventually, the model will break down and be unable to generate meaningful content.

Figure | The model will gradually ignore uncommon elements in the training data.

In summary, the model will gradually forget low-probability events that appear in real language , such as rare words or phrases. This will cause the content generated by the model to lack diversity and fail to correctly simulate the complexity of the real world. In addition, the model will gradually generate content that does not match the real world , such as wrong dates, places, or events. This will cause the content generated by the model to lose credibility and cannot be used for tasks such as reliable information retrieval or knowledge question answering. In addition, the model will gradually learn the biases and discriminations in the training data and reflect them in the generated content.

Why does it happen?

Model collapse is a degradation process where the content generated by the model will contaminate the next generation of training data, causing the model to gradually lose its memory of the real data distribution. Model collapse can be divided into two cases: early and late . In the early stage, the model begins to lose information about low-probability events; in the late stage, the model converges to a distribution that is very different from the original distribution, usually with significantly reduced variance.

Figure | A high-level description of the feedback mechanism in the learning process.

As the number of generations increases, the model tends to generate samples that were more likely to be generated by the original model. At the same time, the tails of the sample distributions of the descendant models become longer. The descendant models start to generate samples that the original model would never have generated, i.e., they start to misunderstand reality based on the errors introduced by the previous model. Although the model trained on the generated data is able to learn parts of the original task, it also makes mistakes, as shown by the increase in perplexity.

Model collapse is mainly caused by three types of error accumulation:

1. Statistical approximation error:

Due to the limited number of samples, the model cannot fully capture all the details of the true data distribution. Over time, low-probability events (i.e., the tail of the distribution) will gradually disappear because the probability of them being sampled is very low.

As the number of model training generations increases, this error will continue to accumulate, causing the model to eventually converge to a distribution that is completely different from the original distribution, with an almost zero tail and a greatly reduced variance.

2. Function expression ability error:

Function approximators such as neural networks have limited expressive power and cannot perfectly approximate any distribution.

This error can cause the model to be biased in approximating the true distribution, for example, assigning high-density areas to low-density areas, or assigning low-density areas to high-density areas.

3. Function approximation error:

Limitations in the learning process, such as structural biases in stochastic gradient descent or the choice of objective function, can also cause model errors.

This error can cause the model to be biased in approximating the true distribution. For example, overfitting the density model causes the model to incorrectly extrapolate the data and assign high-density regions to low-density regions outside the support range of the training set.

Can it be avoided?

The research team believes that it is not impossible to train a model using AI-generated data, but the data must be strictly filtered.

First, in the training data of each generation of models, a certain proportion of the original data is retained, such as 10% or 20%. This ensures that the model is always exposed to real-world samples and avoids relying entirely on what the model generates. The original data is resampled regularly and added to the training data. This ensures that the training data is always fresh and can reflect the latest changes in the real world.

Second, you can use diverse data. For example, in addition to model-generated content, you should also use data generated by humans as training data. Human data is more authentic and reliable, and can help models better understand the complexity and diversity of the real world. In addition, you can use data generated by other types of machine learning models as training data, such as reinforcement learning models or simulators. This ensures the diversity of training data sources and avoids over-reliance on a single type of model.

Finally, you can try to improve the learning algorithm. Research more robust language model training algorithms, such as adversarial training, knowledge distillation, or lifelong learning. These algorithms can help the model better handle noise and bias in the training data and improve the model's generalization ability.

While this warning seems worrisome for both current generative AI technology and the companies seeking to profit from it, it may offer more hope for human content creators in the medium and long term.

In a future world filled with AI tools and the content they generate, human-created content will be more valuable than it is today, if only as a source of raw training data for AI, researchers say.

<<: Will watermelon and crayfish cause poisoning? Will watermelon and peach cause diarrhea? Are there so many rules for eating watermelon?

>>: The recent state of northerners: I haven’t been to the south, but I can feel the “humidity and heat” of the south…

The latest news on delayed retirement: the plan will be officially implemented in 2022! Attached is a list of retirement ages for those born after 1970!

Blog

WeChat Mini Program review, how to publish and launch WeChat Mini Program?

How much does it cost to customize Yichun underwear through the mini program? Yichun underwear applet customization price inquiry

Blog

Bilibili v5.52.0 modified version unlocks the theme and breaks the cache copyright restriction_Free software download center, ai software

Blog

What causes bad taste in mouth? How to solve it!

Blog

Recommend

Alang South Gate Video Hall "2021PR Zero-Based Systematic Editing Thinking Training Camp"

Nanmen Video Studio's "PR Zero-Based Sys...

Marketing promotion: Starbucks cat claw cup becomes a hit, is it hunger marketing at work?

As the main body in the social system, people see...

A mobile phone editing course that is easy for novices to learn, seize the short video trend and earn more than 20,000 yuan a month!

A mobile phone editing course that is easy for no...

After upgrading to Hongmeng system, the phone consumes too much power! Three settings can effectively improve battery life

If the battery consumption of the phone has becom...

Why are the college entrance examinations held on June 7th and 8th? What is the meaning of College Entrance Examination No. 678?

The reason why the college entrance examination i...

What is the difference between TikTok Express and TikTok?

If you want to open Douyin Express Edition and Do...

"How Much Do You Know About Food Nutrition" Series丨@Parents, please check out the healthy eating guide for children's snacks

...

AI training AI? Maybe it will become dumber

The latest news on delayed retirement: the plan will be officially implemented in 2022! Attached is a list of retirement ages for those born after 1970!

WeChat Mini Program review, how to publish and launch WeChat Mini Program?

Huanwang Technology: Cross-border, smart large-screen investment enters the third stage

What are the indoor entertainment places in Qinhuangdao? Where is the best place to go sightseeing in Qinhuangdao?

Beware! 5 common risks in the home, the last one many people don’t know about!

How to attract traffic on Weibo, 5 tips on how to attract traffic (effective)

Write an iOS network request library by yourself - encapsulation interface

How much does it cost to customize Yichun underwear through the mini program? Yichun underwear applet customization price inquiry

Bilibili v5.52.0 modified version unlocks the theme and breaks the cache copyright restriction_Free software download center, ai software

What causes bad taste in mouth? How to solve it!

Recommend

Alang South Gate Video Hall "2021PR Zero-Based Systematic Editing Thinking Training Camp"

Marketing promotion: Starbucks cat claw cup becomes a hit, is it hunger marketing at work?

A mobile phone editing course that is easy for novices to learn, seize the short video trend and earn more than 20,000 yuan a month!

After upgrading to Hongmeng system, the phone consumes too much power! Three settings can effectively improve battery life

Why are the college entrance examinations held on June 7th and 8th? What is the meaning of College Entrance Examination No. 678?

What is the difference between TikTok Express and TikTok?

"How Much Do You Know About Food Nutrition" Series丨@Parents, please check out the healthy eating guide for children's snacks

「Practical Tips」How to place effective advertisements on Double 11?

Ten thousand words interpretation: The top ten brand marketing keywords in 2019!

Tik Tok Training Camp Project Promotion Practice

IDC room server bandwidth rental costs

Migraine only hurts on one side of the head? NO! Here are 7 tips to help you get rid of the headache curse

The GAC Toyota C-HR with a high appearance and a price of around RMB 100,000 is enough to make the Kicks Binzhi tremble

A must-have for operations, promotion and marketing: a complete list of the latest hot topics in July 2017!

If we don’t do a good job in fighting corruption on the Internet, the company will be in trouble.