Artificial intelligence, “abandoning” real data sets?

Currently, artificial intelligence technology has been applied to all aspects of our daily lives, such as face recognition, voice recognition, virtual digital humans, etc.

But a common problem is that if researchers want to train a machine learning model to perform a specific task (such as image classification), they often need to use a large amount of training data, but this data (set) is not always easy to obtain.

For example, if researchers are training a computer vision model for a self-driving car, the real data may not include samples of a person and his dog running on the highway. Once encountered, the model will not know what to do, which may lead to unnecessary consequences.

Moreover, generating datasets using existing data can cost millions of dollars.

Additionally, even the best datasets often contain biases that negatively impact model performance.

So, since it is so expensive to obtain and use a dataset, is it possible to use artificially synthesized data for training while ensuring model performance?

Recently, a study from a research team from the Massachusetts Institute of Technology (MIT) showed that an image classification machine learning model trained with synthetic data can be comparable to or even better than a model trained with real data.

The related research paper is titled "Generative models as a data source for multiview representation learning" and was published as a conference paper at ICLR 2022.

Not lost to real data

This particular machine learning model is called a generative model. Compared to datasets, it requires much less memory to store or share, and not only does it avoid some issues about privacy and usage rights, but it also does not have some of the biases and racial or gender issues that exist in traditional datasets.

According to the paper, during the training process, the generative model first obtains millions of images containing specific objects (such as cars or cats), then learns the appearance of cars or cats, and finally generates similar objects.

In simple terms, the researchers used a pre-trained generative model to output a large stream of unique, realistic images based on the images in the model training dataset.

(Source: Pixabay)

The researchers say that once a generative model is trained on real data, it can generate synthetic data that is almost indistinguishable from real data.

In addition, the generative model can be further expanded based on the training data.

If a generative model is trained on images of cars, it can “imagine” what a car looks like in different situations and then output images of cars with different colors, sizes, and states.

One of the many advantages of generative models is that they can theoretically create an infinite number of samples.

Based on this, the researchers tried to figure out how the number of samples affects model performance. The results showed that in some cases, a large number of unique samples does bring additional improvements.

And, in their opinion, the coolest thing about generative models is that we can find and use them in online repositories, and we can get good performance without intervening in the model.

But generative models also have some drawbacks. For example, in some cases, they may reveal the source data, posing privacy risks, and if not properly audited, they may amplify biases in the datasets they were trained on.

Is Generative AI the Trend?

The scarcity of effective data and sampling bias have become key bottlenecks in the development of machine learning.

In recent years, in order to solve this problem, Generative AI has become one of the hot topics in the field of artificial intelligence and has been given high expectations by the industry.

At the end of last year, Gartner released the important strategic technology trends for 2022, calling generative AI "one of the most compelling and powerful artificial intelligence technologies."

According to Gartner, generative AI is expected to account for 10% of all generated data by 2025, up from less than 1% today.

Figure｜Gartner's important strategic technology trends in 2022 (Source: Gartner official website)

In 2020, generative AI was first proposed as a new technology hotspot in the "Hype Cycle for Artificial Intelligence, 2020" released by Gartner.

In the latest “Hype Cycle for Artificial Intelligence, 2021” report, generative AI appears as a technology that will mature in 2-5 years.

(Source: Gartner Hype Cycle for Artificial Intelligence, 2021)

The breakthrough of generative AI is that it can learn from existing data (images, texts, etc.) and generate new, similar original data. In other words, it can not only make judgments, but also create, and can be used for automatic programming, drug development, visual arts, social interaction, business services, etc.

However, generative AI can also be abused for scams, fraud, political rumors, identity fraud, etc., such as Deepfakes, which often generate various negative news.

So the question is, if we have a good enough generative model, do we still need a real dataset?

Original link:

https://openreview.net/pdf?id=qhAeZjs7dCL

https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315

https://www.gartner.com/en/documents/4004183

Academic headlines

<<: What exactly is the mysterious creature “Water Monkey”?

>>: World Autism Day: Caring for “Children from the Stars”

This city in the clouds has "carbonized" an ecological path, and the natural oxygen bar is "heading towards the clouds"!

The sincere love of the grass and the rolling green shade (Part 1) - You "fern" would never have thought that the "fern scholars" are also so "involuted"

Blog

Do women snore, too? And more than men? The reason is revealed!

Blog

The heart of a luxury car but the destiny of a grocery shopping car, the NIO ES8 has a tested range of 178 kilometers. Where does its future lie?

Blog

Gou Wenqiang's "31 Posture Correction Training Camp" will give you a perfect body

Blog

Recommend

How to attract 600 million low-end users?

In 2019, the concept of sinking markets was extre...

An APP activity operation example

Let’s talk about operation today. In order to avo...

After DeepSeek was open-sourced, major AI companies announced that they would soon open source their products: What exactly is “open source”?

With the development and popularization of artifi...

Deconstruction and reconstruction behind product operations: Why was P&G's brand empire "dismembered"? Why can Toutiao become a platform?

1. Unbundle I mentioned a point in the article &q...

Artificial intelligence, “abandoning” real data sets?

This city in the clouds has "carbonized" an ecological path, and the natural oxygen bar is "heading towards the clouds"!

Fanstong No. 1: Become a Fanstong master in 3 minutes

Great Wall Motors: In November 2023, Great Wall Motors sold 122,849 vehicles, a year-on-year increase of 40.3%

What is acrophobia and why do people fear heights?

Which investment projects are the most profitable in 2020? How to find highly profitable projects in 2021?

The most comprehensive mind map: How to trigger user growth?

The sincere love of the grass and the rolling green shade (Part 1) - You "fern" would never have thought that the "fern scholars" are also so "involuted"

Do women snore, too? And more than men? The reason is revealed!

The heart of a luxury car but the destiny of a grocery shopping car, the NIO ES8 has a tested range of 178 kilometers. Where does its future lie?

Gou Wenqiang's "31 Posture Correction Training Camp" will give you a perfect body

Recommend

How to attract 600 million low-end users?

An APP activity operation example

After DeepSeek was open-sourced, major AI companies announced that they would soon open source their products: What exactly is “open source”?

A brief analysis of 6 types of super traffic content! !

4.9 yuan "Mijian" experience

High-conversion information flow account building routine, just use it directly!

Deconstruction and reconstruction behind product operations: Why was P&G's brand empire "dismembered"? Why can Toutiao become a platform?

Changsha tea tasting service, tea takeaway studio, high-end audition recommended here

Can you still eat the leftover food from the Chinese New Year? Does it really cause cancer? Tell your parents right away

New discovery: meteorites brought water to Earth. Are meteorites the source of life?

What are the murals of Dunhuang? Don’t just think of flying apsaras

Brother Chen's video tutorial "The Secret of Unlocking Relationship Upgrades Between Friends and Lovers"

Father’s Day Marketing Promotion: How Can Brands Break Through and Win?

Tianshui Mini Program Customization Company, how much does it cost to customize a paper product mini program?

WeChat bans XiaoIce, Microsoft's mobile Internet surprise fails