Currently, artificial intelligence technology has been applied to all aspects of our daily lives, such as face recognition, voice recognition, virtual digital humans, etc. But a common problem is that if researchers want to train a machine learning model to perform a specific task (such as image classification), they often need to use a large amount of training data, but this data (set) is not always easy to obtain. For example, if researchers are training a computer vision model for a self-driving car, the real data may not include samples of a person and his dog running on the highway. Once encountered, the model will not know what to do, which may lead to unnecessary consequences. Moreover, generating datasets using existing data can cost millions of dollars. Additionally, even the best datasets often contain biases that negatively impact model performance. So, since it is so expensive to obtain and use a dataset, is it possible to use artificially synthesized data for training while ensuring model performance? Recently, a study from a research team from the Massachusetts Institute of Technology (MIT) showed that an image classification machine learning model trained with synthetic data can be comparable to or even better than a model trained with real data. The related research paper is titled "Generative models as a data source for multiview representation learning" and was published as a conference paper at ICLR 2022. Not lost to real data This particular machine learning model is called a generative model. Compared to datasets, it requires much less memory to store or share, and not only does it avoid some issues about privacy and usage rights, but it also does not have some of the biases and racial or gender issues that exist in traditional datasets. According to the paper, during the training process, the generative model first obtains millions of images containing specific objects (such as cars or cats), then learns the appearance of cars or cats, and finally generates similar objects. In simple terms, the researchers used a pre-trained generative model to output a large stream of unique, realistic images based on the images in the model training dataset. (Source: Pixabay) The researchers say that once a generative model is trained on real data, it can generate synthetic data that is almost indistinguishable from real data. In addition, the generative model can be further expanded based on the training data. If a generative model is trained on images of cars, it can “imagine” what a car looks like in different situations and then output images of cars with different colors, sizes, and states. One of the many advantages of generative models is that they can theoretically create an infinite number of samples. Based on this, the researchers tried to figure out how the number of samples affects model performance. The results showed that in some cases, a large number of unique samples does bring additional improvements. And, in their opinion, the coolest thing about generative models is that we can find and use them in online repositories, and we can get good performance without intervening in the model. But generative models also have some drawbacks. For example, in some cases, they may reveal the source data, posing privacy risks, and if not properly audited, they may amplify biases in the datasets they were trained on. Is Generative AI the Trend? The scarcity of effective data and sampling bias have become key bottlenecks in the development of machine learning. In recent years, in order to solve this problem, Generative AI has become one of the hot topics in the field of artificial intelligence and has been given high expectations by the industry. At the end of last year, Gartner released the important strategic technology trends for 2022, calling generative AI "one of the most compelling and powerful artificial intelligence technologies." According to Gartner, generative AI is expected to account for 10% of all generated data by 2025, up from less than 1% today. Figure|Gartner's important strategic technology trends in 2022 (Source: Gartner official website) In 2020, generative AI was first proposed as a new technology hotspot in the "Hype Cycle for Artificial Intelligence, 2020" released by Gartner. In the latest “Hype Cycle for Artificial Intelligence, 2021” report, generative AI appears as a technology that will mature in 2-5 years. (Source: Gartner Hype Cycle for Artificial Intelligence, 2021) The breakthrough of generative AI is that it can learn from existing data (images, texts, etc.) and generate new, similar original data. In other words, it can not only make judgments, but also create, and can be used for automatic programming, drug development, visual arts, social interaction, business services, etc. However, generative AI can also be abused for scams, fraud, political rumors, identity fraud, etc., such as Deepfakes, which often generate various negative news. So the question is, if we have a good enough generative model, do we still need a real dataset? Original link: https://openreview.net/pdf?id=qhAeZjs7dCL https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315 https://www.gartner.com/en/documents/4004183 Academic headlines |
<<: What exactly is the mysterious creature “Water Monkey”?
>>: World Autism Day: Caring for “Children from the Stars”
Labs Guide 5G smartphones involve the technologic...
Horror movies have been a genre in American movie...
As self-media and internet celebrities became mor...
The public account I currently operate is in a ve...
Sogou bidding promotion is one of the bidding pro...
Tiktok overseas version short video operation tut...
[Mobile software: Bo Ke Yuan] Physicists at Ames ...
1. Activity Background 1. Market conditions Pay a...
Introduction to live broadcast account operation ...
Skyworth Auto announced its delivery data for Dec...
In the consumer goods market, there are many sub-...
Manufacturing, Evolution After experiencing numer...
Nowadays, mobile games are becoming more and more...
In daily data analysis , we often use 8 major mod...
When I saw a netizen asking "Are there bacte...