Those of us who have learned about neural networks all know that there is a very common training method for neural networks, the BP training algorithm. Through the BP algorithm, we can continuously train the network, and eventually make the network infinitely approximate a function we want to fit. In the end, the trained network can perform well on both the training set and the test set!
So what exactly is the BP algorithm? Why can we move towards the optimal value step by step through the BP algorithm (even if it may be a local optimum, not a global optimum, we can also reach the global optimum through other methods), is there any mathematical principle to support it? In the past few days, I have sorted out some knowledge points in this area and wrote them down, first for record, and second for sharing with everyone to prevent misunderstandings and learn and communicate together.
For details about the BP algorithm, please refer to this Zhihu column (which walks through the BP process in detail to deepen your understanding). Let's solve this problem and see why the BP algorithm can lead to better results step by step. First, let's look at the operating principle of the neural network. If we have the following simple network, as shown in the figure:
We define the symbols as follows:
Then we can get the following formula by forward propagation once:
If the loss function C is defined as
Then we hope that the values predicted by the trained network are as close to the real values as possible. Let's ignore the SGD method for now. The most violent way is to hope that for a training data, C can reach the minimum. In the C expression, we can regard the C expression as a function of all w parameters, that is, to find the maximum value of this multivariate function. Then we have successfully introduced a neural network problem into the path of mathematical optimization.
Well, we have now successfully transformed the problem that a neural network needs to solve into the optimization of a multivariate function. The question now is how to modify w to make C get closer and closer to the minimum value. The common method is the gradient descent method (why the opposite direction of the gradient in the gradient descent method is the fastest direction, please refer to my next article, which is not the main topic of this article). It may be a bit abstract here, so let's give a very simple example.
Suppose our network is very simple, as shown below (the symbols are the same as above):
Then we can get:
in
Only the w parameter is unknown, so C can be regarded as a binary function about w (the advantage of a binary function is that we can visualize it in three-dimensional coordinates, which is easy to understand~). The picture comes from the Internet:
Let's go through the algorithm process:
We first start by randomly initializing the w parameter, which is equivalent to corresponding to point A on the graph.
Our goal is to reach the lowest point F, so we move in the opposite direction of the gradient. The formula is as follows:
The size of each step is determined by the previous learning rate. If the next step reaches point B, and the iteration continues like this, if there is only one optimal point in the world, we can reach point F after several iterations, thus solving our problem.
Well, above we give a simple example of a binary function. From analysis to the final result, we can visualize the final steps. If the network becomes complex, the principle of finding the optimal value of a multivariate function is exactly the same! At this point, I have finished the knowledge points to be covered in this article. Welcome friends to point out mistakes and communicate with us~
When I was studying, I already understood the above knowledge, but I was thinking that since I finally got a multivariate function about w, why don’t I just take the partial derivative of each w and then update it directly? Why did the popularity of neural networks require the introduction of the BP algorithm to revive? My question is why can’t we take the partial derivative directly? Why must the BP algorithm appear to make neural networks so applicable? Here are my thoughts and understandings (welcome to communicate~)
1. Why can’t we find the derivative directly?
In neural networks, due to the existence of activation functions, many times when we are at the end of the cost function, the cost function containing the w parameter is not a linear function, such as the simplest
It is impossible to obtain an analytical solution by differentiating this function with respect to w, which explains why it cannot be directly differentiated.
2. Since we cannot directly derive the derivative, can we approximate the derivative? For example, we can use
According to this formula, we can approximately calculate the derivative of each parameter. The smaller the interval, the closer it is. So why can't we do this, but have to wait until the BP algorithm is proposed? Thinking...
A: It is because of the amount of computation. Assuming that there are 1 million weights in our network, every time we calculate the partial derivative of the weight, we need to calculate the change value, and the change value must go through a complete forward propagation. So for each training example, we need 1 million and one forward propagations (and one more to calculate C), while our BP algorithm only needs one back propagation to calculate the partial derivatives of all parameters, which takes a total of two propagations. At this point, I think we have solved why we can't use approximate methods, because the speed is too slow and the computational complexity is too high~ For each propagation, if there are many parameters, the amount of matrix calculations is very large, and the previous machine speed cannot bear it at all~ So until the emergence of BP, the application speed of neural networks has been accelerated.
The above is just my personal understanding. Thanks to Tokugawa for his help! Welcome friends to ask questions and exchange ideas~ The following are the materials and blogs I used for my study: 《Neural networks and deep learning》 Need Chinese version? Welcome to leave a message Email Zero-based entry deep learning (1) - Perceptron