What is the principle of maxpool in CNN?

First, let’s talk about Max pooling in detail.

Max pooling

There is also a pooling operation after convolution. Although there are others such as average pooling, only max pooling is mentioned here.

The operation of max pooling is shown in the figure below: the entire image is divided into several non-overlapping blocks of the same size (pooling size). In each block, only the largest number is taken, and after discarding other nodes, the original plane structure is maintained to obtain the output.

Max pooling is performed separately at different depths and does not require parameter control. So the question is, what is the purpose of max pooling? Doesn't it have any impact if some information is discarded?

The main function of Max pooling is downsampling, but it will not damage the recognition results. This means that there is redundant information in the Feature Map after convolution that is unnecessary for identifying objects. So let's think about how this "redundant" information is generated.

Intuitively, in order to detect the existence of a certain shape, we use a filter to scan the entire image step by step. However, only the output obtained by convolution of the area where the specific shape appears is truly useful. The values obtained by convolving other areas with the filter may have little effect on the determination of whether the shape exists. For example, in the figure below, we still consider detecting the shape of "horizontal fold". In the 3x3 Feature Map obtained after convolution, the node with the number 3 is truly useful, and the other values are irrelevant to this task. Therefore, after using 3x3 Max pooling, there is no effect on the detection of "horizontal fold". Imagine if Max pooling is not used in this example, and the network is allowed to learn by itself. The network will also learn weights that are similar to the effect of Max pooling. Because it is an approximate effect, the cost of adding more parameters is not as good as directly performing Max pooling.

Max pooling also has a function similar to "selection sentence". If there are two nodes, the first node will be the largest under certain input conditions, then the network will only flow information through this node; and other inputs will make the value of the second node the largest, then the network will turn to the branch of this node.

But Max pooling also has disadvantages. Because not all crawling is like the extreme example above. Some surrounding information also affects the judgment of whether a concept exists. And Max pooling performs equivalent operations on all Feature Maps. It is like fishing with a fishing net with the same mesh size, there will definitely be fish that slip through the net.

The following is a brief summary of other pooling methods (the ones I personally think are good and popular pooling methods were sorted out some time ago).

SUM pooling

The middle-layer feature representation method based on SUM pooling refers to summing up all pixel values of the feature map of any channel in the middle layer (for example, VGGNet16, pool5 has 512 channels), so that each channel gets a real value, and N channels will eventually get a vector of length N, which is the result of SUM pooling.

AVE pooling

AVE pooling is average pooling, which is essentially the same as SUM pooling, except that the pixel values are summed and divided by the size of the feature map. The author believes that AVE pooling can bring a certain degree of smoothness and reduce the interference of image size changes. Imagine a 224224 image, resize it to 448448, and use SUM pooling and AVE pooling to extract features from these two images respectively. Our guess is that the cosine similarity calculated by SUM pooling should be smaller than that calculated by AVE pooling, that is, AVE pooling should be slightly better than SUM pooling.

MAX pooling

MAX pooling means that for each channel (assuming there are N channels), the maximum pixel value of the feature map of the channel is selected as the representative of the channel, thereby obtaining an N-dimensional vector representation. The author uses MAX pooling in flask-keras-cnn-image-retrieval.

In the experiments I have done, MAX pooling is slightly better than SUM pooling and AVE pooling. However, the improvement of these three pooling methods for object retrieval is still limited.

MOP pooling

MOP Pooling is derived from the article Multi-scale Orderless Pooling of Deep Convolutional Activation Features, whose first author is Yunchao Gong. When I was working on hashing, I read some of his papers, among which the most representative paper is ITQ. I also wrote a special note on reading papers: Iterative Quantization. The basic idea of MOP pooling is multi-scale and VLAD (the principle of VLAD can be referred to the author's previous blog post Image Retrieval: BoF, VLAD, FV Three Musketeers). The specific pooling steps are as follows:

Overview of multi-scale orderless pooling for CNN activations (MOP-CNN). Our proposed feature is a concatenation of the feature vectors from three levels: (a) Level 1, corresponding to the 4096-dimensional CNN activation for the entire 256256image; (b) Level 2, formed by extracting activations from 128128 patches and VLADpooling them with a codebook of 100 centers; (c) Level 3, formed in the same way as level 2 but with 64*64 patches.

Specifically, at the scale of L=1, that is, the entire image, it is directly resized to a size of 256*256 and then sent to the network to obtain the 4096-dimensional features of the seventh fully connected layer; when L=2, a 128*128 (step size of 32) window is used for sliding. Since the minimum image input size of the network is 256*256, the author upsamples it to 256256, so that many local features can be obtained, and then VLAD encoding is performed on it, in which the cluster center is set to 100, and the 4096-dimensional features are reduced to 500 dimensions, so that 50,000-dimensional features are obtained, and then these 50,000-dimensional features are further reduced to obtain 4096-dimensional features; the processing process of L=3 is the same as that of L=2, except that the window size is programmed to be 64*64.

The author demonstrated through experiments that the features obtained by MOP pooling have a certain invariance. Based on this MOP pooling, the author has not conducted specific experiments, so the experimental results can only refer to the paper itself.

CROW pooling

For Object Retrieval, when using CNN to extract features, what we want is to extract features in the area with objects, just like when extracting local features such as SIFT features to construct BoW, VLAD, and FV vectors, we can use MSER, Saliency and other means to limit SIFT features to the area with objects. Based on this idea, when using CNN for Object Retrieval, we have two ways to further refine the features of Object Retrieval: one is to perform object detection first and then extract CNN features in the detected object area; the other way is that we increase the weight of the object area and reduce the weight of the non-object area through some weight adaptation method. CROW pooling (Cross-dimensional Weighting for Aggregated Deep Convolutional Features) is the latter method. By constructing Spatial weights and Channel weights, CROW pooling can increase the weight of the area of interest to a certain extent and reduce the weight of the non-object area. The specific feature representation construction process is shown in the figure below:

The core process is the Spatial Weight and Channel Weight. When calculating Spatial Weight, the feature map of each channel is directly summed up. This Spatial Weight can actually be understood as a saliency map. We know that through convolution filtering, the areas with strong responses are generally the edges of objects, etc. Therefore, after adding multiple channels together, those areas with non-zero values and large responses are generally the areas where the objects are located, so we can use it as the weight of the feature map. Channel Weight borrows the idea of IDF weight, that is, for some high-frequency words, such as "the", the frequency of such words is very large, but it is actually not very useful for expressing information, that is, it contains too little information. Therefore, in the BoW model, the weights of such stop words need to be reduced. Borrowing from the calculation process of Channel Weight, we can imagine such a situation, for example, a certain channel, its feature map each pixel value is non-zero, and are relatively large, visually it seems that the white area occupies the entire feature map, we can think that the feature map of this channel is not conducive to our location of the object area, so we need to reduce the weight of this channel, and for the channel where the white area occupies a small area of the feature map, we think it contains a lot of information for locating the object, so the weight of this channel should be increased. And this phenomenon is particularly consistent with the idea of IDF, so the author uses IDF as a weight to define Channel Weight.

In general, the design of Spatial Weight and Channel Weight is very clever, but such a pooling method can only fit the area of interest to a certain extent. Let's take a look at the heat map of Spatial Weight*Channel Weight:

From the above, we can see that the weighted part is mainly in the top of the tower, which can be considered as the discriminate area. Of course, we can also see that there are some relatively large weight distributions in other areas of the image, which are areas we don't want. Of course, from the author's visualization of some other pictures, this crow pooling method is not always successful. There are also some pictures whose weighted areas are not the main body of the object in the image. However, from the results of running on a library of tens of millions of images, crow pooling can still achieve good results.

RMAC pooling

The pooling method of RMAC pooling is derived from Particular object retrieval with integral max-pooling of CNN activations. The third author is Hervé Jégou (a good friend of Matthijs Douze). In this article, the author proposed a pooling method of RMAC pooling. Its main idea is similar to the MOP pooling mentioned above. It uses a variable window method to slide the window, but when sliding the window, it is not sliding on the image, but on the feature map (greatly speeding up the feature extraction speed). In addition, when merging local features, MOP pooling uses the VLAD method to merge, while RMAC pooling is simpler (simple does not mean bad effect), and directly adds the local features to obtain the final global features. The specific sliding window method is shown in the figure below:

The figure shows three window sizes, and the 'x' in the figure represents the center of the window. For the feature map of each window, the paper uses the MAX pooling method. When L=3, that is, using the three window sizes shown in the figure, we can get 20 local features. In addition, we do a MAX pooling on the entire feature map to get a global feature. In this way, for an image, we can get 21 local features (if the global features obtained are also regarded as local). These 21 local features are directly added together to get the final global global feature. In the paper, the author compares the impact of the number of sliding windows on mAP. From L=1 to L=3, mAP is gradually improved, but when L=4, mAP no longer improves. In fact, the role of the window designed in RMAC pooling is to locate the position of the object (CROW pooling locates the object position through the weight map). As shown in the figure above, there is a certain overlap between windows, and when the global features are finally formed, the summation method is adopted. Therefore, it can be seen that those overlapping areas can be considered to be given greater weights.

The 20 local features and 1 global feature mentioned above are directly merged and added. Of course, we can also add the 20 local features and then concatenate them with the remaining global feature. In actual experiments, it is found that the concatenation method has a 2%-3% improvement over the previous method. In the test on a library of 1 million images, RMAC pooling can achieve good results, and there is little difference between the two compared with Crow pooling.

The above summarizes 6 different pooling methods. Of course, there are many other pooling methods that are not covered. In practical applications, the author recommends using RMAC pooling and CROW pooling, mainly because these two pooling methods have better effects and lower computational complexity.

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<: Convolutional neural networks cannot process "graph" structured data? This article tells you the answer

>>: Responsive Image Processing in Web Development

The secret weapon of operational optimization: Re-understand the power of heat maps! (superior)

Blog

If you sleep in this position, your body will soon become exhausted!

Blog

Why does uric acid still rise after quitting drinking and eating less high-purine foods? This ingredient in food has been overlooked!

Blog

Ten pictures, detailing how to do user segmentation?

Recommend

Where on Earth can you see the clearest starry sky?

Author: Xingliang (University of Chinese Academy ...

Is it true that Japanese people are rushing to buy natto for epidemic prevention? Is there any epidemic prevention effect? What are the benefits of natto?

Faced with the raging COVID-19 epidemic, ordinary...

What is the principle of maxpool in CNN?

The secret weapon of operational optimization: Re-understand the power of heat maps! (superior)

If you sleep in this position, your body will soon become exhausted!

Why does uric acid still rise after quitting drinking and eating less high-purine foods? This ingredient in food has been overlooked!

Ten pictures, detailing how to do user segmentation?

Guangdiantong Advertising Creative Optimization Operation Guide

Is Zotye, which claims to be original, really not plagiarizing? The new T700 still has many tricks

How can PPTV, which is inherently deficient, become a smart TV

"China's Sky Eye", new news →

Wang Tong: "The Tipping Point for Increasing Followers through Short Videos"

Why is the market still flocking to Apple Watch despite its sky-high pre-sale prices?

Recommend

Where on Earth can you see the clearest starry sky?

Is it true that Japanese people are rushing to buy natto for epidemic prevention? Is there any epidemic prevention effect? What are the benefits of natto?

How to promote to KOL? Here are 3 ways!

WeChat mini-program games are very popular. What scenarios are they not suitable for at this stage?

Does eating baked sweet potatoes often cause cancer? Not really. The sweet potatoes that really shouldn't be eaten are →

Autohome: 2024 New Car Quality Report

Lao Duan talks about OTT: Is there a piece of meat in cable live streaming?

When will the 2022 Paralympic Games close? How many countries are participating?

How to fully plan a screen-sweeping event?

【Full course】Dou Shenda Chinese Classical Poetry Lecture Baidu Cloud Disk

Mail Master Pro is coming to iOS for free this week

Lai Yun's "PBL Project-Based Learning" Intensive Course

GAC Aion officially landed in Hong Kong, advancing into the high-end market with internationalization

When other departments shirk responsibility, how should operations take the blame?

The methodology for launching hot-selling products on Douyin!