Artificial intelligence, “abandoning” real data sets?

Artificial intelligence, “abandoning” real data sets?

Currently, artificial intelligence technology has been applied to all aspects of our daily lives, such as face recognition, voice recognition, virtual digital humans, etc.

But a common problem is that if researchers want to train a machine learning model to perform a specific task (such as image classification), they often need to use a large amount of training data, but this data (set) is not always easy to obtain.

For example, if researchers are training a computer vision model for a self-driving car, the real data may not include samples of a person and his dog running on the highway. Once encountered, the model will not know what to do, which may lead to unnecessary consequences.

Moreover, generating datasets using existing data can cost millions of dollars.

Additionally, even the best datasets often contain biases that negatively impact model performance.

So, since it is so expensive to obtain and use a dataset, is it possible to use artificially synthesized data for training while ensuring model performance?

Recently, a study from a research team from the Massachusetts Institute of Technology (MIT) showed that an image classification machine learning model trained with synthetic data can be comparable to or even better than a model trained with real data.

The related research paper is titled "Generative models as a data source for multiview representation learning" and was published as a conference paper at ICLR 2022.

Not lost to real data

This particular machine learning model is called a generative model. Compared to datasets, it requires much less memory to store or share, and not only does it avoid some issues about privacy and usage rights, but it also does not have some of the biases and racial or gender issues that exist in traditional datasets.

According to the paper, during the training process, the generative model first obtains millions of images containing specific objects (such as cars or cats), then learns the appearance of cars or cats, and finally generates similar objects.

In simple terms, the researchers used a pre-trained generative model to output a large stream of unique, realistic images based on the images in the model training dataset.

(Source: Pixabay)

The researchers say that once a generative model is trained on real data, it can generate synthetic data that is almost indistinguishable from real data.

In addition, the generative model can be further expanded based on the training data.

If a generative model is trained on images of cars, it can “imagine” what a car looks like in different situations and then output images of cars with different colors, sizes, and states.

One of the many advantages of generative models is that they can theoretically create an infinite number of samples.

Based on this, the researchers tried to figure out how the number of samples affects model performance. The results showed that in some cases, a large number of unique samples does bring additional improvements.

And, in their opinion, the coolest thing about generative models is that we can find and use them in online repositories, and we can get good performance without intervening in the model.

But generative models also have some drawbacks. For example, in some cases, they may reveal the source data, posing privacy risks, and if not properly audited, they may amplify biases in the datasets they were trained on.

Is Generative AI the Trend?

The scarcity of effective data and sampling bias have become key bottlenecks in the development of machine learning.

In recent years, in order to solve this problem, Generative AI has become one of the hot topics in the field of artificial intelligence and has been given high expectations by the industry.

At the end of last year, Gartner released the important strategic technology trends for 2022, calling generative AI "one of the most compelling and powerful artificial intelligence technologies."

According to Gartner, generative AI is expected to account for 10% of all generated data by 2025, up from less than 1% today.

Figure|Gartner's important strategic technology trends in 2022 (Source: Gartner official website)

In 2020, generative AI was first proposed as a new technology hotspot in the "Hype Cycle for Artificial Intelligence, 2020" released by Gartner.

In the latest “Hype Cycle for Artificial Intelligence, 2021” report, generative AI appears as a technology that will mature in 2-5 years.

(Source: Gartner Hype Cycle for Artificial Intelligence, 2021)

The breakthrough of generative AI is that it can learn from existing data (images, texts, etc.) and generate new, similar original data. In other words, it can not only make judgments, but also create, and can be used for automatic programming, drug development, visual arts, social interaction, business services, etc.

However, generative AI can also be abused for scams, fraud, political rumors, identity fraud, etc., such as Deepfakes, which often generate various negative news.

So the question is, if we have a good enough generative model, do we still need a real dataset?

Original link:

https://openreview.net/pdf?id=qhAeZjs7dCL

https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315

https://www.gartner.com/en/documents/4004183

Academic headlines

<<:  What exactly is the mysterious creature “Water Monkey”?

>>:  World Autism Day: Caring for “Children from the Stars”

Recommend

Plants also have "social phobias" and become "shy" when touched lightly →

If you touch it lightly, its leaves will close im...

Best Practices for Android Custom BaseAdapter

Although many new projects are using RecyclerView...

Using video previews on the App Store might be a bad marketing strategy

When Apple first released iOS 8 last September, t...

Ideas for building a second type of e-commerce account

After many policy changes, is it still easy to do...

New Android vulnerability exposed: secretly taking photos and uploading

Following the multiple vulnerabilities in iOS7.1.1...

Create a TikTok video promotion ad in 5 minutes!

Come and check out the latest TikTok Ads product ...

How to develop an addictive app?

Although there are a total of 5 million apps avai...