I have recently systematically studied the BP algorithm and wrote this study note. Due to my limited ability, if there are obvious mistakes, please correct me. What is Gradient Descent and the Chain RuleSuppose we have a function J(w) as shown below. Gradient descent diagram Now, we want to know when w is equal to what value, so that J(w) can reach its minimum value. From the graph, we know that the minimum value is to the left of the initial position, which means that if we want to minimize J(w), the value of w needs to be reduced. The slope of the tangent line at the initial position is a > 0 (that is, the derivative corresponding to this position is greater than 0), and w = w – a can reduce the value of w. The derivation is repeated to update w until J(w) reaches its minimum value. If the function J(w) contains multiple variables, then the partial derivatives of different variables must be calculated to update the values of different variables. The so-called chain rule is to find the derivative of a composite function: Chain Rule Let's take an example to make it clearer: Example of chained derivation The structure of a neural networkA neural network consists of three parts: the input layer on the far left, the hidden layer (in actual applications, there are far more than one layer), and the output layer on the far right. The layers are connected by lines, and each connection line has a corresponding weight value w. In addition to the input layer, generally speaking, each neuron also has a corresponding bias b. Neural network structure diagram Except for the neurons in the input layer, each neuron has an input value z obtained by weighted summation and an output value a after z is nonlinearly transformed through the Sigmoid function (also known as the activation function). The calculation formula between them is as follows The calculation formula of neuron output value a The variables l and j in the formula represent the jth neuron in the lth layer, ij represents the connection from the ith neuron to the jth neuron, w represents the weight, and b represents the bias. The meanings of the following symbols are generally similar to those described here, so they will not be explained again. The following Gif animation can make it clearer how the input and output values of each neuron are calculated (note that the animation here does not add the bias, but it will be added in use) Animation showing the calculated neuron output value The reason for using activation functions is that the linear model (which cannot handle linear inseparable situations) has insufficient expressive power, so it is usually necessary to add a Sigmoid function here to add nonlinear factors to obtain the output value of the neuron. As to why the linear function model is not expressive enough, you can click here to view the discussion on Zhihu. Sigmoid function We can see that the range of the Sigmoid function is (0,1). For multi-classification tasks, each neuron in the output layer can represent the probability of belonging to that class. Of course, there are other activation functions, and their uses and advantages and disadvantages are also different. The process of BP algorithm execution (forward pass and reverse update)After manually setting the number of neural network layers, the number of neurons in each layer, and the learning rate η (mentioned below), the BP algorithm will first randomly initialize the weight and bias of each connection line. Then, for each input x and output y in the training set, the BP algorithm will first perform a forward transmission to obtain a predicted value, and then perform a reverse feedback update based on the error between the true value and the predicted value to update the weight of each connection line in the neural network and the preference of each layer. Repeat the above process without reaching the stopping condition. The stop conditions can be the following three:
For example, in handwritten digit recognition, a picture of a handwritten digit 1 stores 28*28 = 784 pixels, and each pixel stores a grayscale value (the range is [0,255]). This means that there are 784 neurons as the input layer, and the output layer has 10 neurons representing the numbers 0 to 9. Each neuron takes a value of 0 to 1, representing the probability that the picture is of this number. Each time a picture (that is, an instance) is input, the neural network performs a forward transfer, calculating the values of the neurons in the output layer layer by layer, and predicts the handwritten number represented by the input picture based on which output neuron has the largest value. Then, based on the value of the output neuron, the error between the predicted value and the true value is calculated, and reverse feedback is used to update the weight of each connection line and the preference of each neuron in the neural network. Feed-Forward From input layer => hidden layer => output layer, the process of calculating the output values of all neurons layer by layer. Back Propagation Because there will be errors between the values of the output layer and the true values, we can use the mean square error to measure the error between the predicted value and the true value. Mean square error The goal of reverse feedback is to make the value of the E function as small as possible. The output value of each neuron is determined by the weight value corresponding to the connecting line of the point and the preference corresponding to the layer. Therefore, to minimize the error function, we need to adjust the w and b values to minimize the error function. Update formula for weights and biases By taking the partial derivatives of w and b for the objective function E, we can get the updated values of w and b. Now let’s take the partial derivative of w as an example for derivation. Where η is the learning rate, which is usually 0.1 ~ 0.3, and can be understood as the step taken by each gradient. Note that the value of w_hj first affects the input value a of the jth output layer neuron, and then affects the output value y. According to the chain rule: Use the chain rule to expand the partial derivatives of the weights According to the definition of neuron output value a: Find the partial derivative of w with respect to the function z The formula for Sigmoid derivative is as follows, from which we can see that it is also very convenient to implement in computer: Sigmoid function derivative so Then the update amount of weight w is: Similarly, the update amount of b is: However, these two formulas can only update the weights of the connection line between the output layer and the previous layer and the bias of the output layer. The reason is that the δ value depends on the true value y, but we only know the true value of the output layer and do not know the true value of each hidden layer, which makes it impossible to calculate the δ value of each hidden layer. Therefore, we hope to use the δ value of the l+1 layer to calculate the δ value of the l layer, which can be done through a series of mathematical transformations. This is the origin of the name of reverse feedback. The formula is as follows: From the formula, we can see that we only need to know the weights of the next layer and the value of the neuron output layer to calculate the δ value of the previous layer. We can update all the weights and biases of the hidden layer by continuously using the above formula. Before derivation, please observe the following picture: Neurons in layers l and l+1 First, we see that the i-th neuron in layer l is connected to all neurons in layer l+1, so we can expand δ into the following formula: That is to say, we can regard E as the z function of the input values of all neurons in the l+1 layer, and n in the above formula represents the number of neurons in the l+1 layer. After simplification, we can get the above formula. The derivation process here only explains the key parts. If you want to view more detailed derivation content, you can click here to download a pdf document that I referred to during the learning process . The derivation process in it is very detailed. In addition, I also referred to the neural network part of machine learning written by Zhou Zhihua and the neural networks and deep learning content. Python source code analysisThe source code comes from Michael Nielsen's deep learning online tutorial, but his content is all in English. I have annotated the source code based on my own understanding and the theoretical knowledge above. >> Click here to view the organized code and number recognition examples << The number of lines of code for the neural network implemented using Python is not large. It only contains a Network class. First, let's take a look at the construction method of this class.
FreedForward code.
The source code uses Stochastic Gradient Descent (SGD) , which has a similar principle to gradient descent. The difference is that the stochastic gradient descent algorithm only takes a part of the samples in the data set to update the values of w and b in each iteration. It is faster than gradient descent, but it does not necessarily converge to the local minimum and may hover around the local minimum.
Update the values of w and b according to the partial derivatives obtained by the backprop method.
The following code is the core part of the source code, which is the implementation of the BP algorithm, including forward transmission and reverse feedback. Forward transmission has a single method in Network (the feedforward method mentioned above), which is used to verify the accuracy of the trained neural network. This method is mentioned below.
The next step is to implement evaluate, calling the feedforward method to calculate the output layer neuron value (that is, the predicted value) of the trained neural network, and then compare the correct value with the predicted value to get the accuracy.
Finally, we can use this source code to train a neural network for handwritten digit recognition and output the evaluation results. The code is as follows:
It can be seen that after 30 rounds of iteration, the accuracy of the handwriting recognition neural network is about 95%. Of course, setting different numbers of iterations, learning rates and sampling numbers will have an impact on the accuracy. How to adjust the parameters is also a technical job. This pit will be filled later. SummarizeAdvantages of Neural Networks: The network essentially implements a mapping function from input to output, and mathematical theory has proven that it has the ability to implement any complex nonlinear mapping, which makes it particularly suitable for solving problems with complex internal mechanisms. The network can automatically extract "reasonable" solution rules by learning a set of instances with correct answers, that is, it has self-learning ability. The Internet has certain promotion and generalization capabilities. Disadvantages of Neural Networks: It is very sensitive to the initial weights and easily converges to the local minimum. Easy to Over Fitting and Over Training. There is no scientific guidance process for how to choose the number of hidden layers and neurons, and sometimes it feels like just guessing. Application areas: Common ones include image classification, autonomous driving, natural language processing, etc. TODOBut in fact, there are still many pitfalls in training a neural network (such as the following four):
refer to[1] Zhou Zhihua Machine Learning [2] Stanford University Machine Learning Online Course [3] Parallel Distributed Processing (1986, by David E. Rumelhart, James L. McClelland), Chapter 8 Learning Internal Representations by Error Propagation [4] How the backpropagation algorithm works [5] Backpropagation Algorithm [6] Chain Rule, a digital course from the Chinese University of Science and Technology in Taiwan, Youtube video, you need to use a VPN to access it. By the way, I would like to recommend their math-related videos because they are very easy to understand. |
<<: Using deep neural networks to solve the problem of NER named entity recognition
>>: Use two pictures to tell you why your app freezes?
In our daily lives, we often deal with static ele...
Many women feel pain all over their body after th...
Mobile phones are indispensable electronic produc...
Every year, from the national level to the villag...
La la la, I haven't written for a long time; ...
According to the industry delivery report previou...
1. Positioning 1. Before opening a store, you fir...
There is no doubt that the topic of mini programs...
2021 is coming to an end and the new year is abou...
For many people, coffee It has a refreshing effec...
Have you ever experienced a situation where you h...
Author: Hu Zhongdong, deputy chief physician, reg...
I am usually willing to try out new products laun...
During the long Spring Festival holiday, everyone...
In the deep blue ocean, sea urchins move quietly ...