Table of contents■ Introduction ■ Step 1: Understand the image as a sample of a probability distribution How do you fill in the missing information? But how do you start counting? These are all images. So how do we complete the image? ■ Step 2: Quickly generate fake images Learning to generate new samples under unknown probability distribution [ML-Heavy] The architecture of Generative Adversarial Net (GAN) Generate fake images using G(z) [ML-Heavy] Training DCGAN Existing GAN and DCGAN implementations [ML-Heavy] Building DCGANs on Tensorflow Running DCGAN on the image set ■ Step 3: Find the best pseudo image for image completion Image completion using DCGAN [ML-Heavy] Loss function for projection to pgpg [ML-Heavy] Using tensorflow for DCGAN image completion Complete the image ■ Conclusion IntroductionContent-aware fill is a powerful tool that designers and photographers can use to fill in unwanted or missing parts of an image. Image completion and inpainting are two closely related techniques for filling in missing or damaged parts of an image. There are many ways to implement content-aware fill, image completion and inpainting. In this blog post, I will introduce a paper by RaymondYeh, Chen Chen et al., "Semantic Image Inpainting with Perceptual and Contextual Losses". The paper was published on arXiv on July 26, 2016 and describes how to use DCGAN networks to perform image completion. The blog post is intended for readers with a general technical background, and some of the content requires a background in machine learning. I have marked the relevant sections with the [ML-Heavy] tag, so you can skip those sections if you don't want to go into too much detail. We will only cover the case of filling in missing parts of human face images. The Tensorflow code related to the blog post has been published on GitHub: bamos/dcgan-completion.tensorflow. Image completion is divided into three steps.
Use Photoshop to complete the missing parts of the image, and use Photoshop to automatically delete the unwanted parts. Image completion is described below. The center of the image is automatically generated. The source code can be downloaded from here. These images are a random sample I took from the LFW dataset. Step 1: Understand the image as a sample of a probability distribution1. How do you fill in the missing information? In the example above, imagine you were building a system that could fill in the missing pieces. How would you go about it? How do you think the human brain would do it? What kind of information did you use? In this blog post, we will focus on two types of information: Contextual information: You can infer information about the missing pixel from the surrounding pixels. Perceptual information: You fill in the blanks with the “normal” information, such as what you see in real life or in other pictures. Both are important. Without contextual information, how do you know which one to fill in? Without perceptual information, the same context can generate countless possibilities. Some images that look "normal" to machine learning systems may not look normal to humans. It would be great if there was an exact, intuitive algorithm that could capture the two properties mentioned in the previous image completion steps. It is possible to construct such an algorithm for specific cases. But there is no general approach. The best solution currently is to get an approximate technique through statistics and machine learning. 2. But how do we start counting? These are all images. To stimulate your thinking, we start with a probability distribution that is easy to understand and can be written in a simple form: a normal distribution. This is the probability density function (PDF) of the normal distribution. You can think of the PDF as a horizontal shift in the input space, with the vertical axis representing the probability of a value occurring. (If you are interested, the code for drawing this picture can be downloaded from bamos/dcgan-completion.tensorflow:simple-distributions.py.) By sampling from this distribution, we can get some data. What needs to be understood is the relationship between PDF and samples. Sampling from a normal distribution PDF and sampling of a 2D image. The PDF is represented as a contour plot with the sample points plotted on top. This is a 1D distribution because the input can only be along one dimension. The same can be done in two dimensions. The key connection between images and statistics is that we can think of images as samples from a high-dimensional probability distribution. The probability distribution corresponds to the pixels of the image. Imagine you are taking a picture with a camera. The image you get is composed of a finite number of pixels. When you take a picture with a camera, you are sampling from this complex probability distribution. This probability distribution determines whether we judge an image to be normal or abnormal. For images, unlike the normal distribution, we cannot know the true probability distribution, we can only collect samples. In this article, we will use color images, which are represented by RGB colors. Our images are 64 pixels wide and 64 pixels high, so our probability distribution is 64⋅64⋅3≈12k dimensional. 3. So how do we complete the image? Let’s first consider the multivariate normal distribution to get some inspiration. Given x=1, what is the most likely value of y? We can fix the value of x and then find the y that maximizes the PDF. In a multidimensional normal distribution, given x, get the maximum possible y This concept can be naturally extended to image probability distribution. We know some values and want to fill in the missing values. This can be simply understood as a maximization problem. We search for all possible missing values, and the image used for completion is the most likely value. From the sample of the normal distribution, we can get the PDF just from the sample. Just pick the statistical model you like and fit it to the data. However, we don’t actually use this approach. For simple distributions, the PDF is easy to find. But for more complex image distributions, it is very difficult and intractable. Part of the reason for the complexity is the complex conditional dependencies: the value of a pixel depends on the values of other pixels in the image. In addition, maximizing a general PDF is a very difficult and intractable non-convex optimization problem. Step 2: Generate fake images quickly1. Learn to generate new samples under unknown probability distribution In addition to learning how to calculate PDFs, another mature idea in statistics is learning how to generate new (random) samples using generative models. Generative models are generally difficult to train and process, but then the deep learning community made an amazing breakthrough in this area. Yann LeCun gave a wonderful discussion on how to train generative models in this Quora answer, calling it the most interesting idea in the field of machine learning in the past 10 years. Yann LeCun’s introduction to Generative Adversarial Networks
I liken the GAN to an arcade game. Two networks compete against each other and improve together. Just like two humans compete in a game. Other deep learning methods, such as Variational Autoencoders (VAEs), can also be used to train generative models. In this blog post, we use Generative Adversarial Nets (GANs). 2. [ML-Heavy] Generative Adversarial Net (GAN) Architecture This idea was proposed by Ian Goodfellow et al. in their landmark paper "Generative Adversarial Nets" (GANs) at the 2014 Neural Information Processing Systems (NIPS) workshop. The main idea is that we define a simple, commonly used distribution, denoted by pzpz. In the following, we use pzpz to represent a uniform distribution on the closed interval -1 to 1. We will denote a sample from the distribution as z∼pzz∼pz. If pzpz is five-dimensional, we can sample it with a line of python numpy code:
Now that we have a simple distribution to sample from, we define a function G(z) to sample from our original probability distribution.
So how do we define G(z) so that it takes in a vector and outputs an image? We will use a deep neural network. There are many tutorials on the basics of neural networks, so I won't cover them here. Some good references include Stanford CS231n course, Ian Goodfellow et al.'s deeplearning book, Image Kernels Explained Visually, and the convolution arithmetic guide. There are many ways to construct a G(z) based on deep learning. The original GAN paper proposed an idea, a training process, and some preliminary experimental results. This idea has been greatly developed, and one of the ideas was proposed in the paper "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" by Alec Radford, Luke Metz, and Soumith Chintala, published at the 2016 International Conference on Learning Representations (ICLR, pronounced "eye-clear"). This paper proposed deep convolutional GANs (called DCGANs) that use micro-strided convolutions to upsample images. So what is a microstrided convolution, and how does it upsample an image? Vincent Dumoulin and Francesco Visin’s paper “A guide to convolution arithmetic for deep learning” and the convolution arithmetic project are a great introduction to convolutions in deep learning. The diagrams are great and give us an intuitive understanding of how microstrided convolutions work. First, make sure you understand how a general convolution slides a kernel through the input space (blue) and produces an output space (green). Here, the output is smaller than the input. (If you don’t understand, refer to the CS231n CNN section or the convolution arithmetic guide) Illustration of a convolution operation, with blue being the input and green being the output. Next, suppose you have a 3X3 input. Our goal is to upsample so that we get a larger output. You can think of micro-strided convolution as enlarging the input image and inserting 0s between pixels. Then we perform a convolution on this enlarged image to get a larger output. Here, the output is 5X5. Illustration of a micro-stepped convolution operation, with blue being the input and green being the output. A side note: convolutional layers that perform upsampling have many names: full convolution, in-network upsampling, fractionally-strided convolution, backwards convolution, deconvolution, upconvolution, or transposed convolution. It is strongly discouraged to use the term "deconvolution" because it already has other meanings: in certain mathematical operations, and in other applications of computer vision, the term has a completely different meaning. Now that we have a micro-strided convolution structure, we can get the representation of G(z), which takes a vector z∼pzz∼pz as input and outputs a 64x64x3 RGB image. A method of constructing a generator using DCGAN. Image from the DCGAN paper The DCGAN paper also proposed other techniques and adjustments when training DCGANs, such as batch normalization and leaky RELUs. 3. Generate fake images using G(z) Let’s stop and appreciate how powerful G(z) is! The DCGAN paper shows what a DCGAN trained on a bedroom dataset looks like. G(z) can then give the following pseudo-images of what the generator thinks a bedroom looks like. None of the images below are in the original dataset! Alternatively, you can also perform algebraic operations on the input space z. Below is a network that generates a face. DCGAN paper on face algebraic operations based on DCGAN 4. [ML-Heavy] Training DCGAN Now that we have defined G(z) and seen how powerful it is, how do we train it? We have many unknown variables (parameters) that we need to find. This is where adversarial networks come in. First we need to define some notation. The probability distribution of the data (unknown) is denoted by pdatapdata. Then G(z), (where z∼pzz∼pz) can be understood as sampling from a probability distribution. Let's denote this probability distribution by pgpg. Symbol pzpdatapg means the probability distribution of z, which is simple and known. The probability distribution of the image (unknown) is the source of the image data samples. The generator G is used to sample the probability distribution. We hope that pg==pdata. Symbol pzz means the probability distribution of pzz, which is simple and known. The probability distribution of the image (unknown) is the source of the image data samples. The generator G is used to sample the probability distribution. We hope that pg==pdata. The discriminator network D(x) takes an input image x and returns the probability that the image x is sampled from the distribution of pdatapdata. In theory, when the input image is sampled from pdatapdata, the discriminator outputs a value close to 1, and when the input image is a pseudo image, such as an image sampled from pgpg, the discriminator outputs a value close to 0. In DCGANs, D(x) is a traditional convolutional neural network. Discriminator convolutional neural network, image from the image restoration paper The goal of training the discriminator is to: 1. For each image of the true data distribution x∼pdatax∼pdata, maximize D(x). 2. For each image that is not in the true data distribution x≁pdatax≁pdata, minimize D(x). The training goal of the generator G(z) is to generate samples that can confuse D. The output is an image that can be used as the input of the discriminator. Therefore, the generator hopes to maximize D(G(z)), which is to minimize (1-D(G(z))), because D is a probability with a value between 0 and 1. The paper proposes that the adversarial network is implemented through the following minimum and maximum strategy. The mathematical expectation in the first term traverses the real data distribution, and the mathematical expectation in the second term traverses the samples in pzpz, that is, traverses G(z)∼pgG(z)∼pg.
The gradient of this expression with respect to the parameters of D and G can be used to train them. We know how to quickly compute each part of this expression. The mathematical expectation can be estimated by a small batch of size m, and the inner maximization can be estimated by k steps of gradients. It has been shown that k=1 is a good value for training. We use θdθd to denote the parameters of the discriminator and θgθg to denote the parameters of the generator. The gradient of the loss with respect to θdθd and θgθg can be calculated by back propagation, because both D and G are composed of mature neural network modules. The following is the training strategy in the GAN paper. In theory, after training, pg==pdatapg==pdata. So G(z) can generate samples that obey the pdatapdata distribution. The training algorithm from the GAN paper. 5. Existing GAN and DCGAN implementations
We will build the model based on carpedm20/DCGAN-tensorflow. 6.[ML-Heavy] Building DCGANs on Tensorflow The implementation of this part is in my bamos/dcgan-completion.tensorflow Github repository. I need to emphasize that this part of the code comes from Taehoon Kim's carpedm20/DCGAN-tensorflow. I use it in my own library so that we can use it in the next part of image completion. Most of the implementation code is in a python class, DCGAN, in model.py. There are many advantages to putting everything into a class so that we can save the intermediate process after training and load it for later use. First we define the generator and discriminator structures. The linear, conv2d_transpose, conv2d, and lrelu functions are defined in ops.py.
When we initialize this class, we will use these two functions to build the model. We need two discriminators that share (reuse) parameters. One for the mini-batch of images from the data distribution and the other for the mini-batch of images generated by the generator.
Next, we define the loss function. Instead of summing, we use the cross entropy between the predicted and true values of D because it is more convenient. The discriminator wants to predict 1 for all "real" data and 0 for all "fake" data generated by the generator. The generator wants the discriminator to predict 1 for both.
Aggregate the variables for each model together so that they can be trained separately.
Now we start to optimize the parameters, using ADAM optimization. It is an adaptive non-convex optimization method that is very competitive with SGD and generally does not require manual tuning of learning rate, momentum, and other hyperparameters.
Next we loop over the data. At each iteration, we sample a mini-batch of data and then use the optimizer to update the network. Interestingly, if G is updated only once, the discriminator loss does not become 0. Also, I think the last calls to d_loss_fake and d_loss_real do some unnecessary calculations, since these values have already been calculated in d_optim and g_optim. As a Tensorflow contact, you can try to optimize this part and send a PR to the original repo.
That’s it! Of course, the full code will have more comments and can be found in model.py. 7. Run DCGAN on the image set If you skipped the previous section but want to play around with the code, this part of the code is in the bamos/dcgan-completion.tensorflow Github repository. I want to reiterate that this code is from Taehoon Kim's carpedm20/DCGAN-tensorflow. We use my library here because it is convenient for the next step. Warning, if you don't have a CUDA-enabled GPU, training this part of the network will be very slow. First, clone my bamos/dcgan-completion.tensorflow Github repository and OpenFace locally. We will use the Python-Only part of OpenFace for image preprocessing. Don’t worry, you don’t need to install OpenFace’s Torch dependencies. Create a new directory and clone the following repository.
Next, install OpenCV and dlib with python2 support. If you're interested, you can try to implement dlib with python3 support. There are some tips for installation, and I wrote some notes in the OpenFace setup guide, including which version I installed and how to install it. Next, install the OpenFace python library so that we can preprocess the images. If you are not using a virtual environment, you will need to use sudo to install it globally when running setup.py. (If this part is difficult for you, you can also use the Docker installation of OpenFace.) Next, download a dataset of face images. It doesn't matter if the dataset has annotations or not, we will delete them. An incomplete list is as follows: MS-Celeb-1M, CelebA, CASIA-WebFace, FaceScrub, LFW, and MegaFace. Put the image in the directory dcgan-completion.tensorflow/data/your-dataset/raw to indicate that it is the raw data of the dataset. Now we use OpenFace's alignment tool to preprocess the image into 64X64 data.
Finally we flatten the directory of processed images so that there are only images in the directory without subfolders.
Now we can train the DCGAN. Install Tensorflow and start training.
You can see what a sample image randomly sampled from the generator looks like in the sample folder. I trained on the CASIA-WebFace dataset and the FaceScrub dataset because I had them at hand. After 14 rounds of training, my sample looks like this. Samples of DCGAN after 14 epochs of training on CASIA-WebFace and FaceScrub You can also view the Tensorflow graph and loss function on TensorBoard.
TensorBoard loss visualization. Updates in real time during training. TensorBoard visualization of DCGAN network Step 3: Find the best pseudo image for image completion1. Use DCGAN for image completion Now that we have the discriminator D(x) and the generator G(z), how do we use it for image completion? In this chapter, I will introduce a paper by RaymondYeh, Chen Chen et al., "Semantic Image Inpainting with Perceptual and Contextual Losses". The paper was published on arXiv on July 26, 2016. For a certain image y, a reasonable but infeasible solution is to maximize D(y) for the missing pixels. The result is neither the data distribution (pdatapdata) nor the generated distribution (pgpg). What we expect is to project y onto the generated distribution. (a): Ideal reconstruction of y from the generated distribution (blue surface). (b): A failed example of trying to reconstruct y by maximizing D(y). Image from the Image Inpainting paper 2.[ML-Heavy] Loss function of projection to pgpg In order to give a reasonable definition of projection, let's first define some symbols for image completion. We use a binary mask M(mask), which has only two values 0 and 1. A value of 1 indicates that we want to keep this part of the image, and a value of 0 indicates that we need to complete this part. Now we can define how to complete y given a binary mask M. Multiply the elements in y with the elements in M. The multiplication of the corresponding positions of two matrices is also called the Hadamard product, denoted by M⊙yM⊙y. M⊙yM⊙y represents the original part of the image. Binary mask legend Next, suppose we have found a z^z^ that can generate a reasonable G(z^)G(z^) to reconstruct the missing values. The completed pixels (1−M)⊙G(z^)(1−M)⊙G(z^) can be added to the original pixels to obtain the reconstructed image:
Now, what we have to do is to find a G(z^)G(z^) suitable for completing the image. In order to find z^z^, we review the context and perception mentioned at the beginning of the article and use them as the context of DCGANs. To this end, we define a loss function for any z∼pzz∼pz. The smaller the loss function, the more suitable z^z^ is. Context loss: In order to get the same context as the input image, we need to ensure that G(z)G(z) at the corresponding position of the known pixel in y is as similar as possible. Therefore, when the output of G(z) is not similar to the image at the known position in y, G(z) needs to be penalized. To do this, we subtract the pixel at the corresponding position in y from G(z) and then get the degree of their dissimilarity:
where ||x||1=∑i|xi|||x||1=∑i|xi| is the l1l1 norm of some vector x. The l2l2 norm is also possible, but the paper points out that practice shows that the l1l1 norm works better. Ideally, the pixels of the known part y and G(z) are equal. That is, for a pixel i at a known position, ||M⊙G(z)i−M⊙yi||=0||M⊙G(z)i−M⊙yi||=0 , Lcontextual(z)=0Lcontextual(z)=0 . Perceptual Loss: In order to reconstruct an image that looks real, we need to make sure that the discriminator determines that the image looks real. To do this, we perform the same steps as in training DCGAN.
Finally, by combining the contextual loss and the perceptual loss, we can find z^z^;
Where λλ is a hyperparameter that controls how important context loss is compared to perceptual loss. (I used the default λ=0.1λ=0.1 and did not play around with it.) G(z) is then used to reconstruct the missing parts of y as before.
The image also uses Poisson blending to smooth the image. 3.[ML-Heavy] Using TensorFlow for DCGAN Image Completion This chapter presents my modifications to Taehoon Kim's carpedm20/DCGAN-tensorflow code for image completion.
We can iteratively find argminzL(z)argminzL(z) by performing gradient descent on the gradient ∇zL(z)∇zL(z). After we define the loss function, Tensorflow's automatic differentiation can automatically calculate this value for us! So, the complete DCGANs-based implementation can be completed by adding 4 lines of Tensorflow code to the existing DCGAN implementation. (Of course, some non-Tensorflow code is also required to implement it.)
Next, we define the mask. I just added one in the center of the image, you can add something else, like a random mask, and then send a pull request.
For gradient descent, we use mini-batch projected gradient descent with momentum for the projection of z on [-1, 1].
4. Complete the image Select some images for image completion and put them in dcgan-completion.tensorflow/your-test-data/raw . Then align them as before in dcgan-completion.tensorflow/your-test-data/aligned . Here I randomly picked some images from LFW. My DCGAN was not trained with LFW images. You can complete the image like this:
This code will generate images and periodically output them to the --outDir folder. You can use ImageMagick to generate a gif:
The final image completion. The center of the image is automatically generated. The source code can be downloaded from here. This is a sample I randomly picked from LFW in conclusionThanks for reading, now we are done! In this article, we covered a method for image completion: 1. Understand the image as a distribution of probabilities. 2. Generate fake images. 3. Find the best pseudo image for completion. My example was a face, but DCGANs can be used on other types of images as well. GANs in general are difficult to train, and it is not clear how to train on a specific type of object, or how to train on large images. However, it is a model with great potential, and I am excited to see what the future holds for GANs! |
<<: Develop Android apps elegantly with Kotlin
>>: Didi Zhang Wensong: Technology changes life, using big data to "solve traffic jams"
Feng Gong's film collection of 12 high-defini...
In early February, Facebook released its full-yea...
"Capturing Generation Z is equivalent to cap...
Many advertisers will control their budgets very ...
This article mainly introduces how much it costs ...
Everyone knows the background. It's the Chine...
Today, the editor learned from the Municipal High...
51st episode of the online special warfare class ...
At present, 5G is penetrating and changing human ...
As an account operator, I pray for the birth of a...
How to open a WeChat mini program store? How to o...
What are the advantages of mobile high bandwidth ...
[[159040]] The industry structure has changed sud...
What is the most difficult thing about starting a...
It is said that the chain of contempt of advertis...