What is the technology behind the popular AI painting? (Part 2)

What is the technology behind the popular AI painting? (Part 2)

In the previous issue, we introduced the GAN model and how it can generate realistic images. However, the GAN model also has serious problems. Because of its own complex adversarial behavior, GAN is difficult to train. Sometimes the model gets stuck or crashes during learning, and the performance returns to the starting point. In addition, the versatility of GAN is very low. If you want to generate a certain type of image, you need to find a large number of real similar images in advance as training data, which also hinders the large-scale application of GAN in different scenarios.

Image source: pixabay

1. Civilian-friendly, pre-trained generative models

The new model has largely solved this problem. In January 2021, the US research institute OPEN AI released DALL·E, and OPEN AI subsequently announced DALL·E 2 in April 2022. Compared with GAN, DALL·E is a pre-trained large model that also has the ability to understand human language, so users only need to input a paragraph of text to directly generate the corresponding image, and no longer need to retrain according to the corresponding data set every time.

Because it no longer requires professional knowledge to train the model and can generate amazing images by simply inputting text, DALL·E 2 caused a huge sensation on foreign social networks. People tried to input all kinds of strange text into the model and then published the generated images, which once became an Internet meme.

DALL·E 2 can not only accurately generate various entities, such as animals, plants, buildings, and people, but also change the painting style according to requirements, from realistic photos to digital art, from oil paintings to simple drawings, from Van Gogh to Andy Hall, from Chinese paintings to Japanese ukiyo-e, from wool fabrics to plasticine styles. Just add one or two words describing the style in the input text, and DALL·E 2 can automatically generate pictures that match this style.

What is even more surprising is that DALL·E often has a very accurate understanding of the connotation of language, so when faced with some completely fictional scenes, it can also generate amazing pictures with complex logic. For example:

2. How was DALL·E 2 trained?

First, OPEN AI obtained hundreds of millions of images and corresponding image captions and trained a model called CLIP.

This model can simultaneously project text and images into a complex high-dimensional space. If there is a correspondence between the image and the text, then the points representing the two in the space will be very close; otherwise, there will be a large distance. Intuitively speaking, this model can capture the semantics of human language and images, and can also find images that match the semantics of the text based on the given text.

The CLIP model can match semantically similar images and texts to points that are close to each other in a high-dimensional space.

The generated image representation is then passed through a diffusion model called GLIDE to add and remove random noise. Because the whole process adds a random factor, a sentence of input text can generate multiple different images - each image conforms to the semantics of the text.

In addition to OpenAI, Google also launched its own model Disco Diffusion, which is very similar to DALLE in terms of technical principles, but allows artists to control some image parameters in addition to inputting theme text.

Which one is better, DALLE·2 or Disco Diffusion? It seems that they each have their own merits and it is hard to tell which one is better. Comparison of the two works has become a popular culture in foreign technology and design circles. In general, the difference in their styles is still very obvious. The pictures generated by DALLE are more logical and realistic. The pictures in the photo style are not likely to be offensive due to distortion. On the other hand, the pictures of Disco Diffusion are more imaginative, have their own style, and are more "artistic".

Although these models are powerful, they cannot understand Chinese and have difficulty generating images with Chinese characteristics, such as traditional Chinese paintings. Therefore, many Chinese institutions are also training models with creative capabilities. Baidu released Wenxin Yige in August 2022, which can not only accept Chinese input, but also generate Chinese paintings or images with the artistic conception of ancient poems.

Baidu's Wenxinyige generated the image "Jiangnan Water Village"

3. Disadvantages of Generated Images

Of course, while appreciating AI works, we cannot ignore the problems that AI produces when painting. The first is the quality of the work. Although AI works are full of impact and visual tension, like almost all other deep learning models, they are not good enough in understanding knowledge, reasoning, and logic. For example, "draw a picture of the world's largest cat" or even "a dog sitting to the left of a cat" will not produce a picture that conforms to logic or common sense. When generating realistic human pictures, sometimes the uncanny valley effect will occur due to slight deviations, to the point of making people uncomfortable.

Another widely noticed problem is that AI often generates oddly shaped hands. This phenomenon is probably due to the fact that the hand is one of the most shape-rich structures in the human body. A human hand has more than 20 joints (compared to only one joint in the face).

Moreover, in most of the pictures used for training, the hands are often not the most core part, so the angles, distances, gestures are different, and they are also blocked by shadows and other objects.

Caption: The hands have a variety of postures

There are even more bizarre "hands" with different shapes and numbers of fingers. These images are labeled as "hands", so that the model thinks that their shapes - and the average form of their shapes - may be reasonable, thus generating various rugged hands.

Even these can be labeled as "hands"

In addition to quality issues, AI-generated content may also raise various ethical issues. For example, biases and stereotypes that often appear in language models are also reflected in image generation. For example, when generating the image of a "big company CEO", a mature white male image is likely to appear.

The bigger concern is that technology can lower the threshold for generating fake content. One example is that almost all the team photos of a company are generated by artificial intelligence technology. If you look closely, you can still find some clues. For example, the second person sitting up in the first row only wears one earring, and the ear contour of the second person from the left in the second row is not normal.

The article is produced by Science Popularization China-Starry Sky Project (Creation and Cultivation). Please indicate the source when reprinting.

Author: Guan Xinyu popular science author

Reviewer: Yu Yang, Head of Tencent Xuanwu Lab

<<:  Your color is quite nice! Do "chameleons" really exist among fish?

>>:  "Zero-carbon agricultural products": Ding! Please check your "new green menu"

Recommend

Samsung monopolizes Snapdragon 835, LG G6 is forced to polish 821 processor

It has been rumored for a long time that Samsung ...

Apple restarting iPhone X production may not be a good idea

[[250541]] Image source: Visual China As early as...

CATL vs BYD: Who will be the final winner?

Another daily limit, another daily limit! After t...

CATL announced that it will cooperate with Tesla

There have been rumors about the cooperation betw...

iPad Pro review: Switch to iPad Air 2

Two days ago, there was news that the iPad Pro sol...

Edge browser expands translation capabilities on Android platform

Following the screenshot feature, Microsoft conti...

Top 10 operational promotion cases in the first half of 2020

2020 will surely be a year that goes down in hist...