1. Background 1.1 BoostingBoosting[1] is a classic method for training ensemble models. One of its specific implementations, GBDT, is widely used in various problems. There are many articles introducing boosting, which will not be repeated here. In simple terms, the boosting method is to train a series of weak classes one by one through specific criteria, and these weak classes are weighted to form a strong classifier (Figure 1). 1.2 Residual NetworkThe residual network [2] is currently the most popular model for tasks such as image classification, and has also been applied to fields such as speech recognition. The core of the residual network is the skip connect or shortcut (Figure 2). This structure makes it easier for the gradient to propagate backward, thus making it possible to train deeper networks. In the previous blog post Residual Network as Ensemble Model, we know that some scholars regard residual network as a special ensemble model [3,4]. One of the authors of the paper is Robert Schapire (just noticed that he has joined Microsoft Research), the originator of AdaBoost (together with Yoav Freund). The view of ensemble is basically the mainstream view (one of them). 2. Training Methods2.1 Framework
That is, this is a linear classifier (Logistic Regression).
Where $C$ is the number of categories in the classification task.
Where $\alpha$ is a scalar, that is, $h$ is a linear combination of two adjacent layers of hypothesis. The first layer has no lower layer, so it can be regarded as a virtual lower layer, $\alpha_0=0$ and $, o_0(x)=0$.
Let the maximum output of the residual network be $F(x)$, and combine it with the above definition, it is obvious that:
We only need to train the residual network level by level (residual block), which is equivalent to training a series of weak classification enembles. In addition to training the weights of the residual network, we also need to train some auxiliary parameters - $\alpha$ and $W$ of each layer (which can be discarded after training). 2.2 Telescoping Sum BoostingThe main text of the article takes the binary classification problem as an example. We are more concerned with the multi-classification problem. The relevant algorithms are in the appendix. The pseudo code given in the article is quite clear, so I will copy it directly as follows: Among them, $\gamma_t$ is a scalar; $C_t$ is an m-by-C (number of samples times number of categories) matrix, and $C_t(i, j)$ represents the element in the $i$th row and $j$th column. It should be noted that $st(x, l)$ represents the $l$th element of $s_t(x)$ (the symbols used here are slightly arbitrary :-); and $st(x) = \sum{\tau=1}^th\tau(x) = \alpha_t \cdot o_t(x) $. Similar to Algorithm 3, $f(g(x_i), l)$ represents the $l$th element of $f(g(x_i))$, and $g(x_i, y_i)$ represents the $i$th element of $g(x_i)$. Obviously, the minimization problem given by Algorithm 4 can be optimized using SGD or solved numerically (Section 4.3 of [1]). 3. TheoryI didn't look at the theoretical part in detail. In general, the author proved that the advantages of retaining BoostResNet as a boost algorithm are: 1) the error decreases exponentially with the network depth (i.e. the number of weak classifiers); 2) anti-overfitting, the model complexity grows linearly with the network depth. For details, please refer to the paper. 4. DiscussionBoostResNet is characterized by layer-by-layer training, which has a series of benefits:
4.2 Some questions The article should compare more with residual networks trained layer by layer (with or without fixing the weights of previous layers), rather than just comparing the so-called e2eResNet. References
|
<<: Teach you how to make word cloud with Python from scratch
>>: Using convolutional autoencoders to reduce noise in images
[[153556]] Similar to Silicon Valley, we also hav...
Dancing with artificial intelligence. Written by ...
Currently, the trend of online video advertising ...
Nowadays, most people use their free time to watc...
There is no project that cannot be handled by tra...
On February 28, Beijing time, Xiaomi released its...
In the public's perception, fission is someth...
According to recent news, Nissan Motor Company re...
Network disk directory Advanced Synthesis Class m...
Many businesses will add some gold foil to foods ...
What does a Klein bottle that is never full look ...
We all know that the ending of "The Three-Bo...
Q: Does the external URL address for initiating r...