Get in the right posture and see the practical ideas of Google's god-level deep learning framework TensorFlow

On November 9, 2015, Google released the artificial intelligence system TensorFlow and announced that it would be open source. This move had a huge impact on the field of deep learning and attracted great attention from a large number of deep learning developers. Of course, there are still many doubts about the field of artificial intelligence, but it is undeniable that artificial intelligence is still the trend of future development.

TensorFlow became the most watched project on the day it was launched on GitHub. As the best way to build deep learning models and the leader of deep learning frameworks, it easily received more than 10,000 star ratings in the week of its release. This is mainly due to Google's remarkable research and development achievements in the field of artificial intelligence and its god-level technical talent pool. Of course, another point is that AlphaGo, which defeated humans for the first time in Go and then maintained an unbeaten record of 60 consecutive games, also implemented its reinforcement learning framework based on TensorFlow's advanced API.

TensorFlow: Why it?

As the second generation DL framework of Goolge, TensorFlow, which uses data flow graphs for calculations, has become one of the most popular frameworks in the fields of machine learning and deep learning. Since its release, TensorFlow has been constantly improving and adding new features. On February 26 this year, TensorFlow 1.0 was officially released at the first annual TensorFlow Developer Summit held in Mountain View. Its biggest highlight is that it achieves the fastest speed by optimizing the model, and it is incredibly fast. What is even more unexpected is that many supporters use the release of TensorFlow 1.0 to define the first year of AI.

According to the above Google index, deep learning ranks first in current process technology.

TensorFlow has achieved the following results in the past:

TensorFlow is used in many Google applications including: Gmail, Google Play Recommendation, Search, Translate, Map, etc.
In the medical field, TensorFlow is used by scientists to build a retinal algorithm to prevent diabetic blindness (it is also mentioned later that a Stanford PhD used TensorFlow to predict skin cancer, and the related work was featured on the cover of Nature);
Use TensorFlow to build deep learning models in the fields of music and painting to help humans better understand art;
Use TensorFlow framework and high-tech equipment to build an automated marine life detection system to help scientists understand the situation of marine life;
TensorFlow is gaining momentum on mobile clients, with many mobile devices using TensorFlow for translation, stylization, and other tasks;
TensorFlow can achieve higher performance and lower power consumption on mobile device CPUs (Qualcomm 820);
The TensorFlow ecosystem combined with other open source projects can quickly build a high-performance production environment;
TensorBoard Embedded vector visualization work
It can help PHD/scientific researchers to quickly carry out project research.

Google's first-generation distributed machine learning framework DistBelief no longer meets Google's internal needs. Google's colleagues redesigned DistBelief to introduce support for various computing devices including CPU/GPU/TPU, and can run well on mobile devices such as Android devices, iOS, Raspberry Pi, etc., and support a variety of different languages (because of various high-level APIs, training only supports Python, inference supports including C++, Go, Java, etc.), and also includes great tools like TensorBoard, which can effectively improve the efficiency of deep learning researchers.

The application of TensorFlow in Google's internal projects is also growing rapidly: it is used in many Google products such as Gmail, Google Play Recommendation, Search, Translate, Map, etc.; there are nearly 100 projects and papers using TensorFlow to do related work.

TensorFlow has also achieved a lot in the past 14 months before the official release, including 475+ non-Google Contributors, 14,000+ commits, more than 5,500 github projects with TensorFlow in the title, 5,000+ answered questions on Stack Overflow, an average of 80+ issue submissions per week, and is used by some top academic research projects: – Neural Machine Translation – Neural Architecture Search – Show and Tell.

Of course, deep learning is to use unsupervised or semi-supervised feature learning, hierarchical feature extraction and high-level algorithms to replace manual feature acquisition. Currently, researchers and developers engaged in deep learning use not only TensorFlow as a deep learning framework, but also many other excellent frameworks in the fields of vision, language, natural language processing and bioinformatics, such as Torch, Caffe, Theano, Deeplearning4j, etc.

Below, the editor has compiled some of Duan Shishi’s blog posts that provide in-depth analysis of network neural models and algorithms to help you understand the power of TensorFlow, an open source deep learning framework.

Deep understanding of Neural Style

This article mainly uses Tensorflow to use CNN to do Neural Style related work on artistic photos. First, the author will explain in detail how the paper A Neural Algorithm of Artistic Style was done, and then combine an open source Tensorflow Neural Style version to appreciate the style of the great god.

A Neural Algorithm of Artistic Style

In the field of art, especially painting, artists create different contents and styles, and blend them to create independent visual experiences. If two images are given, current technology is fully capable of allowing computers to identify the specific content of the images. Style is a very abstract thing. In the eyes of computers, it is of course just some pixels, but the human eye can effectively distinguish the different styles of different painters and whether there are some more complex features to constitute it. When I first studied the DeepLearning paper, the essence of the multi-layer network is actually to find more complex and more intrinsic features, so the style of the image can theoretically be extracted through the multi-layer network to extract some interesting things. This article uses convolutional neural networks (using the pre-trained VGG network model) to do content and style reconstruction respectively, and considers the minimization of content loss and style loss during synthesis (actually also includes the loss of denoising changes), so that the synthesized image will ensure more accurate reconstruction of content and style.

Here is the workflow of the entire paper in neural style. Understanding this picture is crucial to understanding the logic of the entire paper. It is mainly divided into two parts:

Content Reconstruction: The lower part of the above picture is Content Reconstruction, which corresponds to the a, b, c, d, and e layers in CNN. Note that the first part marked with Content Representations is not the original image (it can be understood as the image for computers such as classifiers, so if you visualize it, you may not know what the content is), but the image data of the VGG network model after pre-training. This model is mainly used for object recognition, and here it is mainly used to generate content representations of images. After understanding this, the rest is easier. After five layers of convolutional networks are used to reconstruct the content, the author of the article found in the experiment that the content reconstruction effect of the first three layers is better. The d and e layers lost some detailed information and retained relatively high-level information.
Style Reconstruction: Style reconstruction is more complicated, and it is difficult to model Style. The generation of Style Representation is similar to that of Content Representation, and is also done by the VGG network model. The difference is that a, b, c, d, and e are processed differently. The Reconstruction of Style Representation is calculated on different subsets of CNN. How to say it, it will construct conv1_1(a),[conv1_1, conv2_1](b),[conv1_1, conv2_1, conv3_1],[conv1_1, conv2_1, conv3_1,conv4_1],[conv1_1, conv2_1, conv3_1,conv4_1, conv5_1] respectively. The reconstructed Style will better match the style of the image itself at different scales, ignoring the global information of the scene.

methods

After understanding the above two points, the remaining issue is the data problem of modeling. Here, we calculate the loss separately according to Content and Style. The method of Content loss is relatively simple:

Where F^l is the data representation of the generated Content Representation at the lth layer, P^l is the data representation of the original image at the lth layer, and the squared-error loss is defined as the error between the two feature representations.

Style loss is basically the same as Content loss, except that it includes the sum of errors output by each layer.

Where A^l is the data representation of the original style image at the lth layer, and G^l is the representation of the generated Style Representation at the lth layer.

After defining the loss, we use optimization methods to minimize the model loss (note that there are only content loss and style loss in the paper). The source code also involves noise reduction loss:

I won’t talk about the optimization method here, as tensorflow has built-in methods such as Adam to handle this.

Deep understanding of AlexNet

I have read some Tensorflow documents and some interesting projects before, and found that it is very complicated. I need to spend more time to understand it from the beginning, especially the cv part, which I am particularly interested in. In the next period of time, I will start to learn more about the models that have achieved good results in the ImageNet competition: AlexNet, GoogLeNet, VGG (yes, it is the pretrained model used in the neural network before), and deep residual networks.

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks is the model structure used by Hinton and his student Alex Krizhevsky in the 2012 ImageNet Challenge, which refreshed the probability of Image Classification. Since then, deep learning has surpassed the state-of-art in the field of Image again and again, and even reached the point of defeating humans. While reading this article, I found many optimization techniques that I had seen sporadically before, but many of them were not deeply understood. This article explains how their Alexnet can achieve such good results. Okay, without further ado, let's start reading the article:

This picture is the basic network structure of alexnet in caffe. It is relatively abstract here. The author uses caffe's draw_net to draw the network structure of alexnet:

The basic structure of AlexNet

Alexnet consists of 8 layers in total, of which the first 5 are convolutional and the last 3 are fully-connected. The article says that reducing any convolution will make the result very poor. The following is a detailed description of the composition of each layer:

The input image of the first convolution layer is 227*227*3 (it seems to be 224*224*3 with some problems in the paper). 96 kernels (96,11,11,3) are used, which are shifted right or down in units of 4 pixels, and 5555 convolution rectangle values are generated. Then response-normalized (actually Local Response Normalized, I will talk about this later) and pooled are performed. The pool layer seems to be different between the alexnet in caffe and the paper. Two GPUs are sampled in alexnet, so from the figure above, the thickness of the first convolution layer has two parts, pooling pool_size=(3,3), the sliding step size is 2 pixels, and 96 2727 features are obtained.
The second convolution layer uses 256 kernels (same, distributed on two GPUs, each with 128 kernels (5*5*48)), does pad_size(2,2) processing, and moves in units of 1 pixel (thanks to netizens for pointing it out), which can produce a 27*27 convolution matrix box, does LRN processing, and then pooled. The pooling uses a 3*3 rectangular box with a step size of 2 pixels to obtain 256 13*13 features.
The third and fourth layers do not have LRN and pooling, and the fifth layer only has pooling. The third layer uses 384 kernels (3*3*384, pad_size=(1,1), resulting in 384*15*15, kernel_size is (3, 3), with a step size of 1 pixel, resulting in 384*13*13); the fourth layer uses 384 kernels (pad_size(1,1) to get 384*15*15, kernel size is (3, 3) with a step size of 1 pixel, resulting in 384*13*13); the fifth layer uses 256 kernels (pad_size(1,1) to get 384*15*15, kernel_size(3,3) to get 256*13*13, pool_size(3,3) with a step size of 2 pixels, resulting in 256*6*6).
Fully connected layer: The first two layers have 4096 neurons each, and the final output softmax is 1000 (ImageNet). Note that there are relu, dropout, and innerProduct in the fully connected layer in the caffe figure.

The paper also points out that this picture was made under two GPUs, and there may be some differences between it and the Alexnet in Caffe, but this may not be the point. When using it, you can directly refer to the network results of Alexnet in Caffe. Each layer is very detailed, and the basic structural understanding is consistent with the above.

Why AlexNet achieved better results

I have talked about the basic network structure of AlexNet before. You may have some questions about some of them, such as LRN, Relu, and dropout. I believe that those who have come into contact with DL have heard of or learned about these. Here, I will explain in detail why these things can improve the performance of the final network according to the description in the paper.

ReLU Nonlinearity

Generally speaking, people who are new to neural networks and have not yet gained a deep understanding of deep learning are not very familiar with this. They are generally more familiar with the other two activation functions (which actually introduce nonlinear relationships into neural networks so that neural networks can effectively fit nonlinear functions) tanh(x) and (1+e^(-x))^(-1), and ReLU (Rectified Linear Units) f(x)=max(0,x). The training block of a deep convolutional network based on ReLU is several times that of a network based on tanh. The following figure shows the number of iterations of a four-layer convolutional network based on CIFAR-10 when tanh and ReLU reach a 25% training error:

The solid line and the broken line represent the training error of ReLU and tanh respectively. It can be seen that ReLU can converge faster than tanh.

Local Response Normalization

After using ReLU f(x)=max(0,x), you will find that the value after the activation function does not have a value range like the tanh and sigmoid functions, so generally a normalization is done after ReLU. LRU is a method proposed in the article (not sure here, should it be proposed?). In neuroscience, there is a concept called "Lateral inhibition", which talks about the influence of active neurons on their surrounding neurons.

Dropout

Dropout is also a concept that is often mentioned. It can effectively prevent overfitting of neural networks. Compared with the general linear model that uses regularization to prevent model overfitting, Dropout in neural networks is achieved by modifying the structure of the neural network itself. For a certain layer of neurons, some neurons are randomly deleted with a defined probability, while keeping the individual neurons of the input layer and the output layer unchanged. Then, the parameters are updated according to the learning method of the neural network. In the next iteration, some neurons are randomly deleted again until the training is completed.

Data Augmentation

In fact, the simplest way to enhance model performance and prevent model overfitting is to add data, but there are strategies for adding data. The paper randomly proposes 227*227 patches from 256*256 (224*224 in the paper), and also expands the data set through PCA. This effectively expands the data set. In fact, there are more methods to use depending on your business scenario, such as basic image transformations such as increasing or decreasing brightness, some filtering algorithms, etc. This is a particularly effective method, especially when the amount of data is not large enough.

Deep understanding of GoogLeNet

GoogLeNet is the champion of ILSVRC 2014. It is mainly a tribute to the classic LeNet-5 algorithm. It was mainly completed by Google team members. See the paper Going Deeper with Convolutions. Related work mainly includes LeNet-5, Gabor filters, and Network-in-Network. Network-in-Network improved the traditional CNN network and easily defeated the AlexNet network with a small number of parameters. The final size of the model using Network-in-Network is about 29MNetwork-in-Network caffe model. GoogLeNet borrowed the idea of Network-in-Network, which will be described in detail below.

1) Network-in-Network

On the left is the linear convolution layer of CNN. Generally speaking, the linear convolution layer is used to extract linearly separable features. However, when the extracted features are highly nonlinear, more filters are needed to extract various potential features. This leads to a problem: too many filters lead to too many network parameters, and the network is too complex, which puts too much pressure on calculation.

The article mainly makes some improvements in two ways:

1. Improvement of convolutional layer: MLPconv performs more complex calculations than traditional convolutional layers in each local part, as shown in the right figure above, to improve the recognition ability of each convolutional layer for complex features. Here is an inappropriate example. In the traditional CNN network, each convolutional layer is equivalent to a single task. You must add a large number of filters to complete a specific type of task. However, each conv layer of MLPconv has greater capabilities. Each layer can perform multiple different types of tasks, and only a small amount of parts are needed when selecting filters.
2. Global mean pooling is used to solve the problem that the parameters of the last fully connected layer in the traditional CNN network are too complex. In addition, full connection will cause poor generalization ability of the network. Alexnet uses dropout to improve the generalization ability of the network.

Finally, the author designed a 4-layer Network-in-network + global mean pooling layer to solve the imagenet classification problem.

 class NiN(Network):
 def setup( self ):
        ( self .feed( 'data' )
             .conv( 11 , 11 , 96 , 4 , 4 , padding= 'VALID' , name= 'conv1' )
             .conv( 1 , 1 , 96 , 1 , 1 , name= 'cccp1' )
             .conv( 1 , 1 , 96 , 1 , 1 , name= 'cccp2' )
             .max_pool( 3 , 3 , 2 , 2 , name= 'pool1' )
             .conv( 5 , 5 , 256 , 1 , 1 , name= 'conv2' )
             .conv( 1 , 1 , 256 , 1 , 1 , name= 'cccp3' )
             .conv( 1 , 1 , 256 , 1 , 1 , name= 'cccp4' )
             .max_pool( 3 , 3 , 2 , 2 , padding= 'VALID' , name= 'pool2' )
             .conv( 3 , 3 , 384 , 1 , 1 , name= 'conv3' )
             .conv( 1 , 1 , 384 , 1 , 1 , name= 'cccp5' )
             .conv( 1 , 1 , 384 , 1 , 1 , name= 'cccp6' )
             .max_pool( 3 , 3 , 2 , 2 , padding= 'VALID' , name= 'pool3' )
             .conv( 3 , 3 , 1024 , 1 , 1 , name= 'conv4-1024' )
             .conv( 1 , 1 , 1024 , 1 , 1 , name= 'cccp7-1024' )
             .conv( 1 , 1 , 1000 , 1 , 1 , name= 'cccp8-1024' )
             .avg_pool( 6 , 6 , 1 , 1 , padding= 'VALID' , name= 'pool4' )
             .softmax(name= 'prob' ))

The basic network results are as above, and the code can be found at https://github.com/ethereon/caffe-tensorflow. Due to the author's recent job changes, I don't have a machine to run this article, and I can't draw the basic network structure diagram. I will make up for it later. What is mentioned here is that the middle cccp1 and ccp2 (cross channel pooling) are equivalent to the convolutional layer of 1*1 kernel size. The implementation of NIN in caffe (omitted, please go to the original article to read)

The introduction of NIN can actually be considered as deepening the depth of the network. By deepening the network depth (increasing the feature representation capability of a single NIN) and changing the original fully connected layer to an aver_pool layer, the number of filters required and the parameters of the model are greatly reduced. The experiment in the paper proves that the performance is the same as Alexnet, and the final model size is only 29M.

After understanding NIN, you will no longer feel confused when looking at GoogLeNet.

Pain points:

The larger the CNN network, the larger the model parameters, the more computing power required, and the more complex the model, the more likely it is to overfit.
In CNN, the increase in the number of network layers is accompanied by an increase in the required computing resources;
Sparse networks are acceptable, but sparse data structures are usually computationally inefficient.

Inception module

The Inception module was proposed mainly considering that multiple convolution kernels of different sizes can hold information of different clusters in the image. For the convenience of calculation, the paper uses 1*1, 3*3, and 5*5 respectively, and adds a 3*3 max pooling module. However, there is a big hidden danger in calculation. The output filters of each layer of the Inception module will be the sum of the number of all filters in the branch. After multiple layers, the number of final models will become huge, and the naive inception will have a greater dependence on computing resources. The Network-in-Network model was mentioned earlier. The 1*1 model can effectively reduce the dimension (using less to express as much information as possible), so the article proposed "Inception module with dimension reduction". Under the premise of not losing the model's feature representation ability, the number of filters is minimized to achieve the purpose of reducing the complexity of the model.

Overall of GoogLeNet

The basic code for constructing GoogLeNet in tensorflow is in https://github.com/ethereon/caffe-tensorflow (if you are too lazy to find it, the original text shows it). The author encapsulates some basic operations. After understanding the network structure, it is easy to construct GoogLeNet. After the new company, the author will try to write the network code of GoogLeNet based on tflearn.

GoogLeNet on Tensorflow

For the convenience of implementation, the author rewrote GoogLeNet using tflearn. The only difference between the code and the caffe model is the position of some padding. Because it is troublesome to change, the concat of the inception part must be kept consistent. I don’t know how to modify the pad value here (caffe prototxt), so the padding is set to the same. The specific code (omitted, the original text shows)

If you are interested, you can take a look at this part of the caffe model prototxt. Please help check if there are any problems. The code author has submitted it to the official library of tflearn, add GoogLeNet(Inception) in Example. If you have tensorflow, please install tflearn directly to see if you can help check if there are any problems. Because there is no GPU machine here, it runs slowly. The TensorBoard graph is as follows, not as obvious as the previous Alexnet (mainly because it did not run so many epochs. When writing here, it was found that there was no disk space on the host. It was embarrassing, so I rewrote restore to run. The TensorBoard graph also seems to have some problems. It seems that each time it is loaded, it is not the same. But from the basic log, it is gradually converging. Here is the graph for you to see)

Network structure, there is a bug here, it may be TensorBoard, googlenet graph may be too large, about 1.3M, can not be downloaded on Chrome, tried Firefox seems to be able to:

Deep understanding of VGG\Residual Network

I have just joined a new company and started to study DeepLearning and TensorFlow at work. I have been very busy. I read the papers on VGG and deep residual some time ago, but I have not had time to write them. Today I plan to reread these two related papers carefully.

VGGnet

VGGnet is a related work of the Visual Geometry Group team of Oxford at ILSVRC 2014. The main work is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. As shown in the figure below, the article improves the performance by gradually increasing the depth of the network. Although it seems a little violent and there are not many tricks, it is indeed effective. Many pretrained methods use VGG models (mainly 16 and 19). Compared with other methods, VGG has a large parameter space. The final model is more than 500m, alnext is only 200m, and googlenet is even less, so it usually takes longer to train a vgg model. Fortunately, there are public pretrained models for us to use very conveniently. The pretrained model used in the previous neural style article is as follows:

It can be seen from the figure that from A to the last E, they increase the number of convolution layers in each convolution group. Finally, D and E are our common VGG-16 and VGG-19 models. In C, the author explains that the introduction of 1*1 is to consider linear transformation (the channels are consistent here, and no dimensionality reduction is performed). Later, in the analysis of the final data, C does have a certain degree of improvement over B, but it is not as good as D. The main advantage of VGG is

Measures to reduce parameters: for a group (assuming 3, the paper only stacks three 3*3) convolutions, compared to 7*7, while using 3 layers of nonlinear relationships (3 layers of RELU), the number of parameters is 3*(3^2C^2)=27C^2, while 7*7 is 49C^2, and the parameters are about 81% of 7*7.
Removed LRN, reducing memory consumption and computing time

VGG-16 tflearn implementation

The official github of tflearn provides an implementation of VGG-16 based on tflearn from future import division, print_function, absolute_import

 import tflearn
 from tflearn.layers.core import input_data, dropout, fully_connected
 from tflearn.layers.conv import conv_2d, max_pool_2d
 from tflearn.layers.estimator import regression 
 
 # Data loading and preprocessing  
 import tflearn.datasets.oxflower17 as oxflower17
 X, Y = oxflower17.load_data(one_hot= True ) 
 
 # Building 'VGG Network'  
 network = input_data(shape=[ None , 224 , 224 , 3 ]) 
 
 network = conv_2d(network, 64 , 3 , activation= 'relu' )
 network = conv_2d(network, 64 , 3 , activation= 'relu' )
 network = max_pool_2d(network, 2 , strides= 2 ) 
 
 network = conv_2d(network, 128 , 3 , activation= 'relu' )
 network = conv_2d(network, 128 , 3 , activation= 'relu' )
 network = max_pool_2d(network, 2 , strides= 2 ) 
 
 network = conv_2d(network, 256 , 3 , activation= 'relu' )
 network = conv_2d(network, 256 , 3 , activation= 'relu' )
 network = conv_2d(network, 256 , 3 , activation= 'relu' )
 network = max_pool_2d(network, 2 , strides= 2 ) 
 
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = max_pool_2d(network, 2 , strides= 2 ) 
 
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = conv_2d(network, 512 , 3 , activation= 'relu' )
 network = max_pool_2d(network, 2 , strides= 2 ) 
 
 network = fully_connected(network, 4096 , activation= 'relu' )
 network = dropout(network, 0.5 )
 network = fully_connected(network, 4096 , activation= 'relu' )
 network = dropout(network, 0.5 )
 network = fully_connected(network, 17 , activation= 'softmax' ) 
 
 network = regression(network, optimizer= 'rmsprop' ,
                     loss= 'categorical_crossentropy' ,
                     learning_rate= 0.001 ) 
 
 # Training  
 model = tflearn.DNN(network, checkpoint_path= 'model_vgg' ,
                    max_checkpoints= 1 , tensorboard_verbose= 0 )
 model.fit(X, Y, n_epoch= 500 , shuffle= True ,
          show_metric= True , batch_size= 32 , snapshot_step= 500 ,
          snapshot_epoch= False , run_id= 'vgg_oxflowers17' )

The VGG-16 graph is as follows:

Regarding VGG, the author personally feels that it does not have many highlights. We can use the pre-trained model very well, but it is not as eye-catching as GoogLeNet.

Deep Residual Network

Generally speaking, the deeper the network, the more difficult it is to train. Deep Residual Learning for Image Recognition proposes a residual learning framework that can greatly simplify the training time of the model network, allowing the model to be deeper (152 even tried 1000) within an acceptable time. This method achieved the best results in ILSVRC2015.

As the depth of the model increases, the following problems arise:

Vanishing/exploding gradient makes it very difficult for training to converge. This problem can be solved by normalized initialization and intermediate normalization layers.
If we add more layers to a model with appropriate depth, the accuracy of the model will drop rapidly (not caused by overfitting), and both the training error and test error will be very high. This phenomenon is mentioned in CIFAR-10 and ImageNet.

In order to solve the performance degradation problem caused by increasing depth, the author proposes the following structure for residual learning:

Assuming the potential mapping is H(x), let the stacked nonlinear layers fit F(x):=H(x)-x, residual optimization is easier than optimizing H(x). F(x)+x can be easily achieved through "shortcut connections".

The main improvement of this article is to add residual learning to the traditional convolutional model and find approximately optimal identity mappings through residual optimization.

A network structure in the paper:

The Deep Residual Network tflearn implementation is described in detail in the original paper.

Understanding Fast Neural Style

The previous articles described the commonly used models in the field of Computer Vision. In the next period of time, the author will spend time learning some applications of TensorFlow in the field of Computer Vision, mainly analyzing related papes and source codes. Today, I will learn more about the related work of fast neural style. There are also articles analyzing the content of neural style. That article is the origin of neural style, but it cannot be applied to actual work. Why? It needs to specify the content image and style image every time, and then minimize the content loss and style loss to generate images. It takes a lot of time, and it is impossible to save a certain style of model. Therefore, each time an image is generated, it is a process of training a model. In fast neural style, the trained model of a certain style of image can be saved, and then the content image is transformed. Of course, the article also mentions another application direction of image transform: Super-Resolution, which uses deep learning technology to convert low-resolution images into high-resolution images. It is now used in many large Internet companies, especially video websites.

Paper Principle

A few months ago, I read an article related to Neural Style, TensorFlow: In-depth Understanding of Neural Style, A Neural Algorithm of Aritistic Style, which constructed a multi-layer convolutional network and generated an image that combined content and style by minimizing the defined content loss and style loss. It was very interesting. Perceptual Losses for Real-Time Style Transfer and Super-Resolution uses perceptual loss instead of per-pixels loss, uses a pre-trained vgg model to simplify the original loss calculation, adds a transform Network, and directly generates a style version of the content image. How is it achieved? Please see the figure below:

The entire network is composed of parts: image transformation network, loss networkwrok; Image Transformation network is a deep residual conv networkwrok, which is used to directly transform the input image (content image) into an image with style; and the loss network parameters are fixed. The loss network here is consistent with the network structure in A Neural Algorithm of Aritistic Style, but the parameters are not updated and are only used to calculate content loss and style loss. This is the so-called perceptual loss. The author explains it this way: the pretrained convolution model of Image Classification has already learned perceptual and semantic information (scene and semantic information) very well, so the entire loss network behind is only for calculating content loss and style loss, unlike A Neural Algorithm of Aritistic Style to update the parameters of this part of the network, but the parameters of the previous transform network are updated here. So from the perspective of the entire network structure, the input image is transformed through the transform network to obtain the converted image, and then the corresponding loss is calculated. The entire network updates the previous transform network by minimizing this loss. Isn’t it simple?

The calculation of loss is also very similar to the previous one, content loss:

style loss:

Gram matrix in style loss:

Gram Matrix is a very important thing, it can ensure that y^hat and y have the same shape. The specific description of Gram can be found in this part of the paper. The author does not explain it clearly, I believe readers will understand it at a glance:

I believe that after reading this, you basically understand how this paper is done in fast neural style. To summarize:

The transform network structure is a deep residual network, which transforms the input image into an image with a special style, and the network parameters can be updated.
The loss network structure is similar to the previous paper. Here we mainly calculate the content loss and style loss. Note that the parameters are not updated.
The introduction of Gram matrix makes it convenient to calculate loss when the shape of the transformed image is different from the image after passing through the loss network.

Note: The technical content of this article is authorized by deep learning engineer Duan Shishi . For the sake of reading experience, the content has been slightly modified and integrated, and some practical content has been streamlined. If you want to learn more about deep learning practices, please go to Xiao Shishi's Code Crazy Camp to read.

【Editor's recommendation】

How to view Huawei Software Development Cloud's implementation of DevOps in the era of microservices
Google is going to go against the grain! Analysis of the latest offline AI technology of Google Wear 2.0
Based on React and Vue, how does the mobile open source project Weex define the future?
World-class open source project: How TiDB redefines the next generation of relational databases
APM from entry to abandonment: Analysis of availability monitoring system and optimization methods

[Editor: Lin Shishou TEL: (010) 68476606]

<<: Born out of Gathering | Huawei China ICT Ecosystem Tour 2017 Changping Station Successfully Held

>>: Miscellaneous: MVC/MVP/MVVM (Part 1)

Feeling sleepy and tired in spring? Please check out this spring tea drinking guide

Shanghai expands 31.6km autonomous driving road test range, intelligent connected vehicles become the focus of industry upgrade

Blog

How to prevent user churn starting from the user life cycle?

A foldable phone more suitable for Apple users, why does vivo X Fold 3 become a representative of new quality productivity?

As the growth of the smartphone market slows down...

Spring Security detailed explanation and practical operation, from the shallow to the deep, one-stop mastery of the mainstream security framework

Spring Security detailed explanation and practica...

A jellyfish + a jellyfish = a jellyfish?

Produced by: Science Popularization China Author:...