AI training AI? Maybe it will become dumber

AI training AI? Maybe it will become dumber

Written by Ma Xuewei Edited by Paige

Preface

Currently, in the increasingly hot large model industry, Scaling Law has been proven to still work.

The question is, once high-quality data generated by humans (such as books, articles, photos, videos, etc.) is exhausted, how will large model training proceed ?

At present, a method that is highly expected is to "train itself with the data generated by the large model itself." In fact, if the training data of the subsequent model is also obtained from the network, it will inevitably use the data generated by the previous model.

However, a research team from the University of Oxford and the University of Cambridge and their collaborators have "poured cold water" on this idea.

They came to the conclusion that when the model uses its own generated content during training, irreversible defects will occur and it will gradually forget the true data distribution, resulting in a decline in model performance.

That is, "Model Collapse" .

The related research paper, titled "AI models collapse when trained on recursively generated data", has been published in the authoritative scientific journal Nature.

But they also said that it is not impossible to use data generated by an old model to train a new model, but the data must be strictly filtered.

In a news and opinion article published at the same time, Emily Wenger from Duke University said, "The authors of the paper did not consider what happens when the model is trained on data generated by other models. They focused on the results of the model training on its own output. It remains to be seen whether a model will collapse when trained on the output of other models. Therefore, the next challenge will be to figure out the mechanism by which the model collapse occurs. "

What is model collapse?

Essentially, “model collapse” occurs when data generated by a large model ends up contaminating the training set of subsequent models.

Small models like GMMs and VAEs are typically trained from scratch, while LLMs are very expensive to retrain and are therefore typically initialized using models pre-trained on large text corpora like BERT4, RoBERTa5, or GPT-2 and then fine-tuned for various downstream tasks.

So what happens when a language model is in turn fine-tuned using data generated by other models?

To this end, the research team conducted experiments using the OPT-125m language model and fine-tuned it using the wikitext2 dataset. The experimental results show that the model collapse phenomenon occurs regardless of whether the original data is retained. As the number of iterations increases, the number of low-perplexity samples in the samples generated by the model begins to accumulate, indicating that the model begins to forget the tail events in the real data distribution. Moreover, compared with the original model, the performance of the subsequent iterative model has declined, which is manifested as an increase in perplexity. In addition, the data generated by the model contains a large number of repeated phrases.

Figure | Example of text output for an OPT-125m model affected by model collapse - the model degrades between generations.

Imagine a generative AI model tasked with generating images of dogs. The AI ​​model will tend to reproduce the most common dog breeds in the training data, so it might overrepresent Golden Retrievers and not French Bulldogs. This problem is exacerbated if a subsequent model is trained on an AI-generated dataset that overrepresents Golden Retrievers. After enough rounds of overrepresentation of Golden Retrievers, the model will forget that less popular breeds like French Bulldogs exist and only generate images of Golden Retrievers. Eventually, the model will break down and be unable to generate meaningful content.

Figure | The model will gradually ignore uncommon elements in the training data.

In summary, the model will gradually forget low-probability events that appear in real language , such as rare words or phrases. This will cause the content generated by the model to lack diversity and fail to correctly simulate the complexity of the real world. In addition, the model will gradually generate content that does not match the real world , such as wrong dates, places, or events. This will cause the content generated by the model to lose credibility and cannot be used for tasks such as reliable information retrieval or knowledge question answering. In addition, the model will gradually learn the biases and discriminations in the training data and reflect them in the generated content.

Why does it happen?

Model collapse is a degradation process where the content generated by the model will contaminate the next generation of training data, causing the model to gradually lose its memory of the real data distribution. Model collapse can be divided into two cases: early and late . In the early stage, the model begins to lose information about low-probability events; in the late stage, the model converges to a distribution that is very different from the original distribution, usually with significantly reduced variance.

Figure | A high-level description of the feedback mechanism in the learning process.

As the number of generations increases, the model tends to generate samples that were more likely to be generated by the original model. At the same time, the tails of the sample distributions of the descendant models become longer. The descendant models start to generate samples that the original model would never have generated, i.e., they start to misunderstand reality based on the errors introduced by the previous model. Although the model trained on the generated data is able to learn parts of the original task, it also makes mistakes, as shown by the increase in perplexity.

Model collapse is mainly caused by three types of error accumulation:

1. Statistical approximation error:

Due to the limited number of samples, the model cannot fully capture all the details of the true data distribution. Over time, low-probability events (i.e., the tail of the distribution) will gradually disappear because the probability of them being sampled is very low.

As the number of model training generations increases, this error will continue to accumulate, causing the model to eventually converge to a distribution that is completely different from the original distribution, with an almost zero tail and a greatly reduced variance.

2. Function expression ability error:

Function approximators such as neural networks have limited expressive power and cannot perfectly approximate any distribution.

This error can cause the model to be biased in approximating the true distribution, for example, assigning high-density areas to low-density areas, or assigning low-density areas to high-density areas.

As the number of model training generations increases, this error will continue to accumulate, causing the model to eventually converge to a distribution that is completely different from the original distribution, with an almost zero tail and a greatly reduced variance.

3. Function approximation error:

Limitations in the learning process, such as structural biases in stochastic gradient descent or the choice of objective function, can also cause model errors.

This error can cause the model to be biased in approximating the true distribution. For example, overfitting the density model causes the model to incorrectly extrapolate the data and assign high-density regions to low-density regions outside the support range of the training set.

As the number of model training generations increases, this error will continue to accumulate, causing the model to eventually converge to a distribution that is completely different from the original distribution, with an almost zero tail and a greatly reduced variance.

Can it be avoided?

The research team believes that it is not impossible to train a model using AI-generated data, but the data must be strictly filtered.

First, in the training data of each generation of models, a certain proportion of the original data is retained, such as 10% or 20%. This ensures that the model is always exposed to real-world samples and avoids relying entirely on what the model generates. The original data is resampled regularly and added to the training data. This ensures that the training data is always fresh and can reflect the latest changes in the real world.

Second, you can use diverse data. For example, in addition to model-generated content, you should also use data generated by humans as training data. Human data is more authentic and reliable, and can help models better understand the complexity and diversity of the real world. In addition, you can use data generated by other types of machine learning models as training data, such as reinforcement learning models or simulators. This ensures the diversity of training data sources and avoids over-reliance on a single type of model.

Finally, you can try to improve the learning algorithm. Research more robust language model training algorithms, such as adversarial training, knowledge distillation, or lifelong learning. These algorithms can help the model better handle noise and bias in the training data and improve the model's generalization ability.

While this warning seems worrisome for both current generative AI technology and the companies seeking to profit from it, it may offer more hope for human content creators in the medium and long term.

In a future world filled with AI tools and the content they generate, human-created content will be more valuable than it is today, if only as a source of raw training data for AI, researchers say.

<<:  Will watermelon and crayfish cause poisoning? Will watermelon and peach cause diarrhea? Are there so many rules for eating watermelon?

>>:  The recent state of northerners: I haven’t been to the south, but I can feel the “humidity and heat” of the south…

Recommend

Analysis of WeChat reading products!

This article is the first in a series of analysis...

Title tips to get 100,000+ clicks on public account articles!

Want your public account articles to get 100,000+...

Ouch! It hurts! How is pain measured?

Audit expert: Yin Tielun Deputy Chief Physician, ...

Kuaishou Account Practical Operation Guide

1. How to formulate an operation plan for a new a...

Xiaoyu Yilian announced the completion of 125 million B round of financing

On the morning of March 22, Xiaoyu Yilian held a ...

How to acquire customers through low-cost fission? Share 7 techniques

Since 2017, online traffic has become more and mo...

The Verge's Best Tablets of 2015

We have seen a variety of new tablets this year, ...

70 years ago today, a paper that changed life sciences was published

On April 25, 1953, a paper that influenced the wo...

How to acquire a large number of accurate users at low cost?

In 2017, the operation method of increasing fans ...

How to plan a public relations event?

How do PR activities affect brands? Organizing a ...