Training deep residual neural networks based on boosting principle

Training deep residual neural networks based on boosting principle

1. Background

1.1 Boosting

Boosting[1] is a classic method for training ensemble models. One of its specific implementations, GBDT, is widely used in various problems. There are many articles introducing boosting, which will not be repeated here. In simple terms, the boosting method is to train a series of weak classes one by one through specific criteria, and these weak classes are weighted to form a strong classifier (Figure 1).

1.2 Residual Network

The residual network [2] is currently the most popular model for tasks such as image classification, and has also been applied to fields such as speech recognition. The core of the residual network is the skip connect or shortcut (Figure 2). This structure makes it easier for the gradient to propagate backward, thus making it possible to train deeper networks.

In the previous blog post Residual Network as Ensemble Model, we know that some scholars regard residual network as a special ensemble model [3,4]. One of the authors of the paper is Robert Schapire (just noticed that he has joined Microsoft Research), the originator of AdaBoost (together with Yoav Freund). The view of ensemble is basically the mainstream view (one of them).

2. Training Methods

2.1 Framework

  • Residual Network

That is, this is a linear classifier (Logistic Regression).

  • hypothesis module

Where $C$ is the number of categories in the classification task.

  • weak module classifier

Where $\alpha$ is a scalar, that is, $h$ is a linear combination of two adjacent layers of hypothesis. The first layer has no lower layer, so it can be regarded as a virtual lower layer, $\alpha_0=0$ and $, o_0(x)=0$.

  • Display the residual network as an ensemble

Let the maximum output of the residual network be $F(x)$, and combine it with the above definition, it is obvious that:

The technique of splitting and summing (telescoping sum) is used here, so the author calls the proposed algorithm telescoping sum boosting.

We only need to train the residual network level by level (residual block), which is equivalent to training a series of weak classification enembles. In addition to training the weights of the residual network, we also need to train some auxiliary parameters - $\alpha$ and $W$ of each layer (which can be discarded after training).

2.2 Telescoping Sum Boosting

The main text of the article takes the binary classification problem as an example. We are more concerned with the multi-classification problem. The relevant algorithms are in the appendix. The pseudo code given in the article is quite clear, so I will copy it directly as follows:

Among them, $\gamma_t$ is a scalar; $C_t$ is an m-by-C (number of samples times number of categories) matrix, and $C_t(i, j)$ represents the element in the $i$th row and $j$th column.

It should be noted that $st(x, l)$ represents the $l$th element of $s_t(x)$ (the symbols used here are slightly arbitrary :-); and $st(x) = \sum{\tau=1}^th\tau(x) = \alpha_t \cdot o_t(x) $.

Similar to Algorithm 3, $f(g(x_i), l)$ represents the $l$th element of $f(g(x_i))$, and $g(x_i, y_i)$ represents the $i$th element of $g(x_i)$.

Obviously, the minimization problem given by Algorithm 4 can be optimized using SGD or solved numerically (Section 4.3 of [1]).

3. Theory

I didn't look at the theoretical part in detail. In general, the author proved that the advantages of retaining BoostResNet as a boost algorithm are: 1) the error decreases exponentially with the network depth (i.e. the number of weak classifiers); 2) anti-overfitting, the model complexity grows linearly with the network depth. For details, please refer to the paper.

4. Discussion

BoostResNet is characterized by layer-by-layer training, which has a series of benefits:

  • Reduce memory usage (Memory Efficient), making it possible to train large deep networks. (Currently we can only train thousand-layer residual networks on CIFAR, just to satisfy our curiosity)
  • Reduce the amount of computation (Computationally Efficient), only train a shallow model at each level.
  • Because only shallow models need to be trained, there are more options for optimization methods (non-SGD methods).
  • In addition, the number of network layers can be dynamically determined based on the training situation.

4.2 Some questions

The article should compare more with residual networks trained layer by layer (with or without fixing the weights of previous layers), rather than just comparing the so-called e2eResNet.
The author also mentioned in Section 1.1*** that the training framework is not limited to ResNet, or even limited to neural networks. I don’t know how it will work for training ordinary deep models, and competitive layer-wise pretraining now seems a bit outdated.

References

  1. Schapire & Freund. Boosting: Foundations and Algorithms. MIT.
  2. He et al. Deep Residual Learning for Image Recognition.
  3. Veit et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
  4. Xie et al. Aggregated Residual Transformations for Deep Neural Networks.

<<:  Teach you how to make word cloud with Python from scratch

>>:  Using convolutional autoencoders to reduce noise in images

Recommend

Teach you step by step how to promote information flow videos!

Currently, the trend of online video advertising ...

Tik Tok search user ranking rules, how are Tik Tok users ranked?

Nowadays, most people use their free time to watc...

To do Zhihu traffic promotion, you must master the skills!

There is no project that cannot be handled by tra...

Fission effect is not good? You need to know these 5 details!

In the public's perception, fission is someth...

Swallowing gold is a form of "suicide", why do some people still eat gold?

Many businesses will add some gold foil to foods ...

What does the Klein bottle that is never full look like? | Expo Daily

What does a Klein bottle that is never full look ...