What is the principle of maxpool in CNN?

What is the principle of maxpool in CNN?

First, let’s talk about Max pooling in detail.

Max pooling

There is also a pooling operation after convolution. Although there are others such as average pooling, only max pooling is mentioned here.

The operation of max pooling is shown in the figure below: the entire image is divided into several non-overlapping blocks of the same size (pooling size). In each block, only the largest number is taken, and after discarding other nodes, the original plane structure is maintained to obtain the output.

Max pooling is performed separately at different depths and does not require parameter control. So the question is, what is the purpose of max pooling? Doesn't it have any impact if some information is discarded?

The main function of Max pooling is downsampling, but it will not damage the recognition results. This means that there is redundant information in the Feature Map after convolution that is unnecessary for identifying objects. So let's think about how this "redundant" information is generated.

Intuitively, in order to detect the existence of a certain shape, we use a filter to scan the entire image step by step. However, only the output obtained by convolution of the area where the specific shape appears is truly useful. The values ​​obtained by convolving other areas with the filter may have little effect on the determination of whether the shape exists. For example, in the figure below, we still consider detecting the shape of "horizontal fold". In the 3x3 Feature Map obtained after convolution, the node with the number 3 is truly useful, and the other values ​​are irrelevant to this task. Therefore, after using 3x3 Max pooling, there is no effect on the detection of "horizontal fold". Imagine if Max pooling is not used in this example, and the network is allowed to learn by itself. The network will also learn weights that are similar to the effect of Max pooling. Because it is an approximate effect, the cost of adding more parameters is not as good as directly performing Max pooling.

Max pooling also has a function similar to "selection sentence". If there are two nodes, the first node will be the largest under certain input conditions, then the network will only flow information through this node; and other inputs will make the value of the second node the largest, then the network will turn to the branch of this node.

But Max pooling also has disadvantages. Because not all crawling is like the extreme example above. Some surrounding information also affects the judgment of whether a concept exists. And Max pooling performs equivalent operations on all Feature Maps. It is like fishing with a fishing net with the same mesh size, there will definitely be fish that slip through the net.

The following is a brief summary of other pooling methods (the ones I personally think are good and popular pooling methods were sorted out some time ago).

SUM pooling

The middle-layer feature representation method based on SUM pooling refers to summing up all pixel values ​​of the feature map of any channel in the middle layer (for example, VGGNet16, pool5 has 512 channels), so that each channel gets a real value, and N channels will eventually get a vector of length N, which is the result of SUM pooling.

AVE pooling

AVE pooling is average pooling, which is essentially the same as SUM pooling, except that the pixel values ​​are summed and divided by the size of the feature map. The author believes that AVE pooling can bring a certain degree of smoothness and reduce the interference of image size changes. Imagine a 224224 image, resize it to 448448, and use SUM pooling and AVE pooling to extract features from these two images respectively. Our guess is that the cosine similarity calculated by SUM pooling should be smaller than that calculated by AVE pooling, that is, AVE pooling should be slightly better than SUM pooling.

MAX pooling

MAX pooling means that for each channel (assuming there are N channels), the maximum pixel value of the feature map of the channel is selected as the representative of the channel, thereby obtaining an N-dimensional vector representation. The author uses MAX pooling in flask-keras-cnn-image-retrieval.

In the experiments I have done, MAX pooling is slightly better than SUM pooling and AVE pooling. However, the improvement of these three pooling methods for object retrieval is still limited.

MOP pooling

MOP Pooling is derived from the article Multi-scale Orderless Pooling of Deep Convolutional Activation Features, whose first author is Yunchao Gong. When I was working on hashing, I read some of his papers, among which the most representative paper is ITQ. I also wrote a special note on reading papers: Iterative Quantization. The basic idea of ​​MOP pooling is multi-scale and VLAD (the principle of VLAD can be referred to the author's previous blog post Image Retrieval: BoF, VLAD, FV Three Musketeers). The specific pooling steps are as follows:

Overview of multi-scale orderless pooling for CNN activations (MOP-CNN). Our proposed feature is a concatenation of the feature vectors from three levels: (a) Level 1, corresponding to the 4096-dimensional CNN activation for the entire 256256image; (b) Level 2, formed by extracting activations from 128128 patches and VLADpooling them with a codebook of 100 centers; (c) Level 3, formed in the same way as level 2 but with 64*64 patches.

Specifically, at the scale of L=1, that is, the entire image, it is directly resized to a size of 256*256 and then sent to the network to obtain the 4096-dimensional features of the seventh fully connected layer; when L=2, a 128*128 (step size of 32) window is used for sliding. Since the minimum image input size of the network is 256*256, the author upsamples it to 256256, so that many local features can be obtained, and then VLAD encoding is performed on it, in which the cluster center is set to 100, and the 4096-dimensional features are reduced to 500 dimensions, so that 50,000-dimensional features are obtained, and then these 50,000-dimensional features are further reduced to obtain 4096-dimensional features; the processing process of L=3 is the same as that of L=2, except that the window size is programmed to be 64*64.

The author demonstrated through experiments that the features obtained by MOP pooling have a certain invariance. Based on this MOP pooling, the author has not conducted specific experiments, so the experimental results can only refer to the paper itself.

CROW pooling

For Object Retrieval, when using CNN to extract features, what we want is to extract features in the area with objects, just like when extracting local features such as SIFT features to construct BoW, VLAD, and FV vectors, we can use MSER, Saliency and other means to limit SIFT features to the area with objects. Based on this idea, when using CNN for Object Retrieval, we have two ways to further refine the features of Object Retrieval: one is to perform object detection first and then extract CNN features in the detected object area; the other way is that we increase the weight of the object area and reduce the weight of the non-object area through some weight adaptation method. CROW pooling (Cross-dimensional Weighting for Aggregated Deep Convolutional Features) is the latter method. By constructing Spatial weights and Channel weights, CROW pooling can increase the weight of the area of ​​interest to a certain extent and reduce the weight of the non-object area. The specific feature representation construction process is shown in the figure below:

The core process is the Spatial Weight and Channel Weight. When calculating Spatial Weight, the feature map of each channel is directly summed up. This Spatial Weight can actually be understood as a saliency map. We know that through convolution filtering, the areas with strong responses are generally the edges of objects, etc. Therefore, after adding multiple channels together, those areas with non-zero values ​​and large responses are generally the areas where the objects are located, so we can use it as the weight of the feature map. Channel Weight borrows the idea of ​​IDF weight, that is, for some high-frequency words, such as "the", the frequency of such words is very large, but it is actually not very useful for expressing information, that is, it contains too little information. Therefore, in the BoW model, the weights of such stop words need to be reduced. Borrowing from the calculation process of Channel Weight, we can imagine such a situation, for example, a certain channel, its feature map each pixel value is non-zero, and are relatively large, visually it seems that the white area occupies the entire feature map, we can think that the feature map of this channel is not conducive to our location of the object area, so we need to reduce the weight of this channel, and for the channel where the white area occupies a small area of ​​the feature map, we think it contains a lot of information for locating the object, so the weight of this channel should be increased. And this phenomenon is particularly consistent with the idea of ​​IDF, so the author uses IDF as a weight to define Channel Weight.

In general, the design of Spatial Weight and Channel Weight is very clever, but such a pooling method can only fit the area of ​​interest to a certain extent. Let's take a look at the heat map of Spatial Weight*Channel Weight:

From the above, we can see that the weighted part is mainly in the top of the tower, which can be considered as the discriminate area. Of course, we can also see that there are some relatively large weight distributions in other areas of the image, which are areas we don't want. Of course, from the author's visualization of some other pictures, this crow pooling method is not always successful. There are also some pictures whose weighted areas are not the main body of the object in the image. However, from the results of running on a library of tens of millions of images, crow pooling can still achieve good results.

RMAC pooling

The pooling method of RMAC pooling is derived from Particular object retrieval with integral max-pooling of CNN activations. The third author is Hervé Jégou (a good friend of Matthijs Douze). In this article, the author proposed a pooling method of RMAC pooling. Its main idea is similar to the MOP pooling mentioned above. It uses a variable window method to slide the window, but when sliding the window, it is not sliding on the image, but on the feature map (greatly speeding up the feature extraction speed). In addition, when merging local features, MOP pooling uses the VLAD method to merge, while RMAC pooling is simpler (simple does not mean bad effect), and directly adds the local features to obtain the final global features. The specific sliding window method is shown in the figure below:

The figure shows three window sizes, and the 'x' in the figure represents the center of the window. For the feature map of each window, the paper uses the MAX pooling method. When L=3, that is, using the three window sizes shown in the figure, we can get 20 local features. In addition, we do a MAX pooling on the entire feature map to get a global feature. In this way, for an image, we can get 21 local features (if the global features obtained are also regarded as local). These 21 local features are directly added together to get the final global global feature. In the paper, the author compares the impact of the number of sliding windows on mAP. From L=1 to L=3, mAP is gradually improved, but when L=4, mAP no longer improves. In fact, the role of the window designed in RMAC pooling is to locate the position of the object (CROW pooling locates the object position through the weight map). As shown in the figure above, there is a certain overlap between windows, and when the global features are finally formed, the summation method is adopted. Therefore, it can be seen that those overlapping areas can be considered to be given greater weights.

The 20 local features and 1 global feature mentioned above are directly merged and added. Of course, we can also add the 20 local features and then concatenate them with the remaining global feature. In actual experiments, it is found that the concatenation method has a 2%-3% improvement over the previous method. In the test on a library of 1 million images, RMAC pooling can achieve good results, and there is little difference between the two compared with Crow pooling.

The above summarizes 6 different pooling methods. Of course, there are many other pooling methods that are not covered. In practical applications, the author recommends using RMAC pooling and CROW pooling, mainly because these two pooling methods have better effects and lower computational complexity.

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<:  Convolutional neural networks cannot process "graph" structured data? This article tells you the answer

>>:  Responsive Image Processing in Web Development

Recommend

Why use blogs to promote your website? How to promote your products online?

Blog promotion is a form of Internet promotion, a...

What is LSD? It once made Steve Jobs addicted

Lysergic acid diethylamide, also known as "l...

Jianyang SEO Training: What does over-optimization of a website mean?

What does over-optimization of a website mean? Ho...

Thunderstorm, gale or hail is coming, please take precautions in these areas →

The Central Meteorological Observatory continued ...

Why did such an ordinary little sand pit cause him to fall and die?

Perhaps you often see these small pits in the san...

PC profiteers are very cunning! Uncover the secrets of Taobao's 8GHz computers

In recent years, DIY computer assembly has gradual...

How to advertise in Douyin?

1. Traffic dividend channels Any place with traff...

What role does the operations manager of a mobile game company play?

In the eyes of many people, mobile gaming is a hi...

What are the requirements for inviting Pixiu?

As we all know, Pixiu is a auspicious animal that...

Product growth strategy methodology!

Product strategy is not based on features, but on...

Brand Marketing: What are the principles of brand names?

1. Why is the brand name worth millions? 1. Becau...