1. Background 1.1 BoostingBoosting[1] is a classic method for training ensemble models. One of its specific implementations, GBDT, is widely used in various problems. There are many articles introducing boosting, which will not be repeated here. In simple terms, the boosting method is to train a series of weak classes one by one through specific criteria, and these weak classes are weighted to form a strong classifier (Figure 1). 1.2 Residual NetworkThe residual network [2] is currently the most popular model for tasks such as image classification, and has also been applied to fields such as speech recognition. The core of the residual network is the skip connect or shortcut (Figure 2). This structure makes it easier for the gradient to propagate backward, thus making it possible to train deeper networks. In the previous blog post Residual Network as Ensemble Model, we know that some scholars regard residual network as a special ensemble model [3,4]. One of the authors of the paper is Robert Schapire (just noticed that he has joined Microsoft Research), the originator of AdaBoost (together with Yoav Freund). The view of ensemble is basically the mainstream view (one of them). 2. Training Methods2.1 Framework
That is, this is a linear classifier (Logistic Regression).
Where $C$ is the number of categories in the classification task.
Where $\alpha$ is a scalar, that is, $h$ is a linear combination of two adjacent layers of hypothesis. The first layer has no lower layer, so it can be regarded as a virtual lower layer, $\alpha_0=0$ and $, o_0(x)=0$.
Let the maximum output of the residual network be $F(x)$, and combine it with the above definition, it is obvious that:
We only need to train the residual network level by level (residual block), which is equivalent to training a series of weak classification enembles. In addition to training the weights of the residual network, we also need to train some auxiliary parameters - $\alpha$ and $W$ of each layer (which can be discarded after training). 2.2 Telescoping Sum BoostingThe main text of the article takes the binary classification problem as an example. We are more concerned with the multi-classification problem. The relevant algorithms are in the appendix. The pseudo code given in the article is quite clear, so I will copy it directly as follows: Among them, $\gamma_t$ is a scalar; $C_t$ is an m-by-C (number of samples times number of categories) matrix, and $C_t(i, j)$ represents the element in the $i$th row and $j$th column. It should be noted that $st(x, l)$ represents the $l$th element of $s_t(x)$ (the symbols used here are slightly arbitrary :-); and $st(x) = \sum{\tau=1}^th\tau(x) = \alpha_t \cdot o_t(x) $. Similar to Algorithm 3, $f(g(x_i), l)$ represents the $l$th element of $f(g(x_i))$, and $g(x_i, y_i)$ represents the $i$th element of $g(x_i)$. Obviously, the minimization problem given by Algorithm 4 can be optimized using SGD or solved numerically (Section 4.3 of [1]). 3. TheoryI didn't look at the theoretical part in detail. In general, the author proved that the advantages of retaining BoostResNet as a boost algorithm are: 1) the error decreases exponentially with the network depth (i.e. the number of weak classifiers); 2) anti-overfitting, the model complexity grows linearly with the network depth. For details, please refer to the paper. 4. DiscussionBoostResNet is characterized by layer-by-layer training, which has a series of benefits:
4.2 Some questions The article should compare more with residual networks trained layer by layer (with or without fixing the weights of previous layers), rather than just comparing the so-called e2eResNet. References
|
<<: Teach you how to make word cloud with Python from scratch
>>: Using convolutional autoencoders to reduce noise in images
Yang Mingzhe Introduction: Imagine we throw a sto...
Well, let’s start from the small circle again. Re...
Produced by: Science Popularization China Author:...
Introduction: When uploading new products to the ...
Google's technology is one of the main drivin...
The sharing economy is indeed a good business, but...
On an August morning, on the island of Tahiti in ...
There is such a magical product: 👚Wearing it, you...
Spontaneous combustion of vehicles has always bee...
All along “Drink 8 glasses of water every day” It...
Recently, the Oriental White Stork, known as the ...
How much does it cost to join the Dongguan Toys M...
Many mobile applications have gained a large numb...
Source code introduction Similar to 360 mobile as...
A few days ago, a friend left a message saying: D...