Developer’s statement: This is how I learned GAN

Developer’s statement: This is how I learned GAN

Generative Adversarial Network, or GAN, is a well-known concept proposed by Ian Goodfellow. It has become the hottest topic in deep learning in the past two years. It seems that anything can be made by GAN. I just started to learn GAN recently, read some materials, and made some notes.

1.Generation

What is generation? It means that the model learns some data and then generates similar data. Let the machine look at some animal pictures and then generate animal pictures by itself. This is generation.

In the past, there were many technologies that could be used for generation, such as auto-encoder, the structure of which is as follows:

You train an encoder to convert the input into a code, then train a decoder to convert the code into an image, and then calculate the MSE (mean square error) between the image and the input. After training this model, take out the second half of the NN Decoder, input a random code, and you can generate an image.

But the effect of the auto-encoder generating an image is of course very strange, and you can tell whether it is real or fake at a glance. So later, generative models such as VAE were proposed. I don’t know much about it, so I won’t go into details here.

The above-mentioned generative models actually have a very serious drawback. For example, VAE, the image it generates is expected to be as similar to the input as possible, but how does the model measure this similarity? The model will calculate a loss, which is mostly MSE, that is, the mean square error of each pixel. Does a small loss really mean similarity?

For example, of these two pictures, the first one is considered a good generated picture, and the second one is a bad generated picture. However, for the above model, the loss calculated for these two pictures is the same, so they are considered to be equally good pictures.

This is the drawback of the above generative model. The standard used to measure the quality of generated images cannot achieve the desired purpose. So there is GAN, which is discussed below.

2. GAN

How does the famous GAN generate images? First of all, we all know that GAN has two networks, one is the generator and the other is the discriminator. Inspired by the two-person zero-sum game, the two networks compete with each other to achieve the best generation effect. The process is as follows:

The main process is similar to the figure above. First, there is a first-generation generator, which can generate some very bad pictures, and then there is a first-generation discriminator, which can accurately classify the generated pictures and the real pictures. In short, this discriminator is a binary classifier, which outputs 0 for the generated pictures and 1 for the real pictures.

Next, we start training a second-generation generator that can generate slightly better images, allowing the first-generation discriminator to believe that these generated images are real images. Then we train a second-generation discriminator that can accurately distinguish between real images and images generated by the second-generation generator. And so on, there will be three generations, four generations... After n generations of generators and discriminators, the discriminator cannot distinguish between generated images and real images, and the network is fitted.

That’s it. Is that the end? Of course not. Let’s talk about how GAN works.

3. Principle

First, we know the distribution of the real picture set Pdata(x), where x is a real picture and can be imagined as a vector. The distribution of this vector set is Pdata. We need to generate some pictures that are also within this distribution. If we use this distribution directly, I am afraid it is impossible to do it.

The distribution generated by the generator we have now can be assumed to be PG(x;θ), which is a distribution controlled by θ, where θ is the parameter of this distribution (if it is a Gaussian mixture model, then θ is the mean and variance of each Gaussian distribution)

Suppose we take some data from the true distribution, {x1, x2, ... , xm}, and we want to compute a likelihood PG(xi; θ).

For these data, the likelihood in the generative model is

We want to maximize this likelihood, which is equivalent to letting the generator generate the probability of those real pictures. This becomes a problem of maximizing likelihood estimation, and we need to find a θ* to maximize this likelihood.

Finding a θ* to optimize this likelihood is equivalent to optimizing the log likelihood. Because the m data are taken from the true distribution, it is approximately equal to the expectation of the log likelihood of all x in the true distribution in the PG distribution.

The expectation of all x in the true distribution is equivalent to finding the probability integral, so it can be converted into an integral operation. Since the term after the minus sign is independent of θ, it is still equivalent after adding it. Then take out the common terms, reverse the terms in the brackets, and change max to min, and it can be converted into the form of KL divergence. KL divergence describes the difference between two probability distributions.

Therefore, we need to maximize the likelihood and let the generator generate real pictures with the highest probability, that is, we need to find a θ that makes PG closer to Pdata.

So how do we find the most reasonable θ? We can assume that PG(x; θ) is a neural network.

First, we randomly generate a vector z through the network G(z)=x to generate a picture x. So how do we compare whether the two distributions are similar? As long as we take a set of samples z that conform to a distribution, we can generate another distribution PG through the network and then compare it with the true distribution Pdata.

As we all know, as long as a neural network has a nonlinear activation function, it can fit any function, and the same is true for distribution. Therefore, we can use a normal distribution or a Gaussian distribution to sample and train a neural network to learn a very complex distribution.

How to find a closer distribution is the contribution of GAN. First, the formula of GAN is given:

The advantage of this formula is that when G is fixed, max V(G,D) represents the difference between PG and Pdata, and then we need to find a best G so that this best value is minimized, that is, the difference between the two distributions is minimized.

On the surface, this means that D should make this formula as large as possible, that is, for x in the real distribution, D(x) should be close to 1, and for x from the generated distribution, D(x) should be close to 0. Then G should make the formula as small as possible, so that for x from the generated distribution, D(x) is as close to 1 as possible.

Now let's fix G to solve for D:

For a given x, we get the maximum D as shown above, which is in the range of (0,1). Substitute the maximum D into

You can get:

JS divergence is a symmetric smoothed version of KL divergence, which represents the difference between the two distributions. This derivation shows what is said above, fixing G.

Represents the difference between two distributions, with a minimum value of -2log2 and a maximum value of 0.

Now we need to find a G that minimizes

Observe the above formula, when PG(x)=Pdata(x), G is optimal.

4. Training

With the above derivation, we can start training GAN. Combined with what we said at the beginning, two networks are trained alternately. We can have a G0 and D0 at the beginning, and train D0 first to find:

Then fix D0 and start training G0. You can use gradient descent in the training process. And so on, train D1, G1, D2, G2, ...

But there is a problem here, you may get at position D0*:

Then update G0 to G1, possibly

However, it is not guaranteed that a new point D1* will appear so that

This update of G does not achieve the desired effect, as shown in the following figure:

The way to avoid the above situation is to not update G too much when updating G.

Knowing the training order of the network, we also need to set two loss functions, one is the loss of D and the other is the loss of G. The following are the specific steps of the entire GAN training:

The above steps are also very common in machine learning and deep learning and are easy to understand.

5. Problems

However, there is still a small problem with the loss function of G above. The following figure shows the images of the two functions:

log(1-D(x)) is the loss function of G during our calculations, but we found that when D(x) is close to 0, this function is very smooth and the gradient is very small. This will result in G changing very slowly in the early stages of training in order to deceive D, while the trend of the function above is the same as the one below, both decreasing. However, its advantage is that when D(x) is close to 0, the gradient is large, which is conducive to training. As D(x) increases, the gradient decreases, which is also in line with reality. The training speed should be faster in the early stages and slower in the later stages.

So we modify the loss function of G to

This can increase the speed of training.

There is another problem, which is raised in other papers. It is found through experiments that after many trainings, the loss is always flat, that is,

JS divergence is always log2, PG and Pdata have no intersection at all, but in fact the two distributions do have intersection. The reason for this is that we cannot really calculate the expectation and integral, and can only use the sample method. If the training is overfitted, D can still completely separate the two parts of the points, as shown in the following figure:

Regarding this issue, should we make D weaker and reduce its classification ability? However, theoretically, in order for it to effectively distinguish between real and fake images, we also hope that it can be powerful, so there is a contradiction here.

Another possible reason is that, although both distributions are high-dimensional, they are both very narrow and their intersection may be quite small, which will also cause the JS divergence to be calculated = log2, which is approximately equal to no intersection.

Some solutions include adding noise to make the two distributions wider, which may increase their intersection so that the JS divergence can be calculated, but the noise needs to gradually decrease over time.

There is another problem called Mode Collapse, as shown below:

What this figure means is that the distribution of the data is bimodal, but the learned generated distribution is only unimodal. We can see the data learned by the model, but we don’t know the distribution it has not learned.

The reason for this is that the two distributions in KL divergence are written in reverse.

This figure clearly shows that if the first KL divergence is written, in order to prevent infinity, all places where Pdata appears must be covered by PG, and Mode Collapse will not occur.

6. References

These are some notes and understandings of the introductory study of GAN. Later, I was too lazy to type the formulas. I mainly referred to the video of Professor Li Hongyi:

http://t.cn/RKXQOV0

This article is reproduced from Leifeng.com. The author of this article is Ma Shaonan and it was originally published in the author’s Zhihu column.

<<:  How to decompile Android APK

>>:  How much do you know about performance optimization?

Recommend

Practical experience: Talk about the practical experience of "user growth"!

The concept of User Growth (UG) originated from t...

How to do bidding promotion? Here’s a universal method!

Many readers left messages in the background sayi...

Understand information flow video ads and their delivery strategies in 3 minutes!

On the one hand, the professionalism of “If you h...

Analysis of 4 factors of Keep user experience!

A week ago, QuestMobile released the "China ...

Uncovering the magic number behind user growth [5]

"5 times" is a magical number. Users wh...

A perfect event planning plan cannot be without these elements!

Do you hope that the event will become a hit? Tha...

Advertising in the dental medical industry circle of friends

Background Check: As people's quality of life...

Information flow advertising: budget 400, keywords 2W+, how to promote?

This week, the editor has sorted out several diff...

Weibo advertising creative optimization skills, placement and traffic generation

I believe most advertisers are very familiar with...