Training deep residual neural networks based on boosting principle

1. Background

1.1 Boosting

Boosting[1] is a classic method for training ensemble models. One of its specific implementations, GBDT, is widely used in various problems. There are many articles introducing boosting, which will not be repeated here. In simple terms, the boosting method is to train a series of weak classes one by one through specific criteria, and these weak classes are weighted to form a strong classifier (Figure 1).

1.2 Residual Network

The residual network [2] is currently the most popular model for tasks such as image classification, and has also been applied to fields such as speech recognition. The core of the residual network is the skip connect or shortcut (Figure 2). This structure makes it easier for the gradient to propagate backward, thus making it possible to train deeper networks.

In the previous blog post Residual Network as Ensemble Model, we know that some scholars regard residual network as a special ensemble model [3,4]. One of the authors of the paper is Robert Schapire (just noticed that he has joined Microsoft Research), the originator of AdaBoost (together with Yoav Freund). The view of ensemble is basically the mainstream view (one of them).

2. Training Methods

2.1 Framework

Residual Network

That is, this is a linear classifier (Logistic Regression).

hypothesis module

Where $C$ is the number of categories in the classification task.

weak module classifier

Where $\alpha$ is a scalar, that is, $h$ is a linear combination of two adjacent layers of hypothesis. The first layer has no lower layer, so it can be regarded as a virtual lower layer, $\alpha_0=0$ and $, o_0(x)=0$.

Display the residual network as an ensemble

Let the maximum output of the residual network be $F(x)$, and combine it with the above definition, it is obvious that:

The technique of splitting and summing (telescoping sum) is used here, so the author calls the proposed algorithm telescoping sum boosting.

We only need to train the residual network level by level (residual block), which is equivalent to training a series of weak classification enembles. In addition to training the weights of the residual network, we also need to train some auxiliary parameters - $\alpha$ and $W$ of each layer (which can be discarded after training).

2.2 Telescoping Sum Boosting

The main text of the article takes the binary classification problem as an example. We are more concerned with the multi-classification problem. The relevant algorithms are in the appendix. The pseudo code given in the article is quite clear, so I will copy it directly as follows:

Among them, $\gamma_t$ is a scalar; $C_t$ is an m-by-C (number of samples times number of categories) matrix, and $C_t(i, j)$ represents the element in the $i$th row and $j$th column.

It should be noted that $st(x, l)$ represents the $l$th element of $s_t(x)$ (the symbols used here are slightly arbitrary :-); and $st(x) = \sum{\tau=1}^th\tau(x) = \alpha_t \cdot o_t(x) $.

Similar to Algorithm 3, $f(g(x_i), l)$ represents the $l$th element of $f(g(x_i))$, and $g(x_i, y_i)$ represents the $i$th element of $g(x_i)$.

Obviously, the minimization problem given by Algorithm 4 can be optimized using SGD or solved numerically (Section 4.3 of [1]).

3. Theory

I didn't look at the theoretical part in detail. In general, the author proved that the advantages of retaining BoostResNet as a boost algorithm are: 1) the error decreases exponentially with the network depth (i.e. the number of weak classifiers); 2) anti-overfitting, the model complexity grows linearly with the network depth. For details, please refer to the paper.

4. Discussion

BoostResNet is characterized by layer-by-layer training, which has a series of benefits:

Reduce memory usage (Memory Efficient), making it possible to train large deep networks. (Currently we can only train thousand-layer residual networks on CIFAR, just to satisfy our curiosity)
Reduce the amount of computation (Computationally Efficient), only train a shallow model at each level.
Because only shallow models need to be trained, there are more options for optimization methods (non-SGD methods).
In addition, the number of network layers can be dynamically determined based on the training situation.

4.2 Some questions

The article should compare more with residual networks trained layer by layer (with or without fixing the weights of previous layers), rather than just comparing the so-called e2eResNet.
The author also mentioned in Section 1.1*** that the training framework is not limited to ResNet, or even limited to neural networks. I don’t know how it will work for training ordinary deep models, and competitive layer-wise pretraining now seems a bit outdated.

References

Schapire & Freund. Boosting: Foundations and Algorithms. MIT.
He et al. Deep Residual Learning for Image Recognition.
Veit et al. Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
Xie et al. Aggregated Residual Transformations for Deep Neural Networks.

<<: Teach you how to make word cloud with Python from scratch

>>: Using convolutional autoencoders to reduce noise in images

In addition to the first wave of users who have the motivation to spread the word, you also need to know KOLs

How much does it cost to customize the Putian flash sale mini program? Putian Seckill Mini Program Customized Price Inquiry

Blog

How to acquire a large number of accurate users at low cost?

How much does it cost to develop a Daqing business mini program? What is the quotation for the development of Daqing Business Mini Program?

There are two types of Daqing Commerce WeChat Min...

Training deep residual neural networks based on boosting principle

1.1 Boosting

1.2 Residual Network

2. Training Methods

2.1 Framework

2.2 Telescoping Sum Boosting

3. Theory

4. Discussion

4.2 Some questions

References

In addition to the first wave of users who have the motivation to spread the word, you also need to know KOLs

These 5 "treasure vegetables" in the sea are rich in bioactive substances! It is strongly recommended to eat more

How can educational institutions achieve online distribution and content fission?

Three steps to create a valuable corporate WeChat public account

I have no idea about the marketing H5 for Qingming Festival. What else can I do to create a hit product this year?

How much does it cost to customize the Putian flash sale mini program? Putian Seckill Mini Program Customized Price Inquiry

How to acquire a large number of accurate users at low cost?

Download the complete set of PDF materials for Grades 1-6 of Picture Book Classroom

Has the secret of crop circles been solved long ago?

Can you "recharge" yourself by getting more sun? Uncover the secret of energy in the sun

Recommend

How can an ordinary self-media person become a marketing tycoon? Watch me subdue the dragon with eighteen moves

Why do some medicines cause coma when taken as pieces, but some are fine when taken as chewed?

The latest Google Admob channel promotion strategy in 2015

What on earth is rock music? Why so many people want to do it!

How much does it cost to develop a Daqing business mini program? What is the quotation for the development of Daqing Business Mini Program?

Oops, I squinted! Is it over after I “squinted”? Where did the foreign object that got into my eye go…

CES2015: Razer Forge TV Gaming Set-Top Box

iPhone 6: Bigger but less classy

What are the catalysts for artificial rainfall? Will the use of these catalysts affect the environment?

Why is Apple Watch failing?

Shengyao Cross-border "How to do Amazon e-commerce" teaches you how to make money on Amazon in seven days

WeChat will stop opening the App technical service for Mini Programs after May 19

Why is Mango TV always the one that eats crabs?

Why is Qinling Mountains called the “Central Water Tower”?

Can eating more "blue foods" such as seafood solve the human food crisis?