The principle of BP algorithm in neural network and source code analysis of Python implementation

I have recently systematically studied the BP algorithm and wrote this study note. Due to my limited ability, if there are obvious mistakes, please correct me.

What is Gradient Descent and the Chain Rule

Suppose we have a function J(w) as shown below.

Gradient descent diagram

Now, we want to know when w is equal to what value, so that J(w) can reach its minimum value. From the graph, we know that the minimum value is to the left of the initial position, which means that if we want to minimize J(w), the value of w needs to be reduced. The slope of the tangent line at the initial position is a > 0 (that is, the derivative corresponding to this position is greater than 0), and w = w – a can reduce the value of w. The derivation is repeated to update w until J(w) reaches its minimum value. If the function J(w) contains multiple variables, then the partial derivatives of different variables must be calculated to update the values of different variables.

The so-called chain rule is to find the derivative of a composite function:

Chain Rule

Let's take an example to make it clearer:

Example of chained derivation

The structure of a neural network

A neural network consists of three parts: the input layer on the far left, the hidden layer (in actual applications, there are far more than one layer), and the output layer on the far right. The layers are connected by lines, and each connection line has a corresponding weight value w. In addition to the input layer, generally speaking, each neuron also has a corresponding bias b.

Neural network structure diagram

Except for the neurons in the input layer, each neuron has an input value z obtained by weighted summation and an output value a after z is nonlinearly transformed through the Sigmoid function (also known as the activation function). The calculation formula between them is as follows

The calculation formula of neuron output value a

The variables l and j in the formula represent the jth neuron in the lth layer, ij represents the connection from the ith neuron to the jth neuron, w represents the weight, and b represents the bias. The meanings of the following symbols are generally similar to those described here, so they will not be explained again. The following Gif animation can make it clearer how the input and output values of each neuron are calculated (note that the animation here does not add the bias, but it will be added in use)

Animation showing the calculated neuron output value

The reason for using activation functions is that the linear model (which cannot handle linear inseparable situations) has insufficient expressive power, so it is usually necessary to add a Sigmoid function here to add nonlinear factors to obtain the output value of the neuron.

As to why the linear function model is not expressive enough, you can click here to view the discussion on Zhihu.

Sigmoid function

We can see that the range of the Sigmoid function is (0,1). For multi-classification tasks, each neuron in the output layer can represent the probability of belonging to that class. Of course, there are other activation functions, and their uses and advantages and disadvantages are also different.

The process of BP algorithm execution (forward pass and reverse update)

After manually setting the number of neural network layers, the number of neurons in each layer, and the learning rate η (mentioned below), the BP algorithm will first randomly initialize the weight and bias of each connection line. Then, for each input x and output y in the training set, the BP algorithm will first perform a forward transmission to obtain a predicted value, and then perform a reverse feedback update based on the error between the true value and the predicted value to update the weight of each connection line in the neural network and the preference of each layer. Repeat the above process without reaching the stopping condition.

The stop conditions can be the following three:

● When the weight update is below a certain threshold
● The prediction error rate is below a certain threshold
● Reach a preset number of iterations

For example, in handwritten digit recognition, a picture of a handwritten digit 1 stores 28*28 = 784 pixels, and each pixel stores a grayscale value (the range is [0,255]). This means that there are 784 neurons as the input layer, and the output layer has 10 neurons representing the numbers 0 to 9. Each neuron takes a value of 0 to 1, representing the probability that the picture is of this number.

Each time a picture (that is, an instance) is input, the neural network performs a forward transfer, calculating the values of the neurons in the output layer layer by layer, and predicts the handwritten number represented by the input picture based on which output neuron has the largest value.

Then, based on the value of the output neuron, the error between the predicted value and the true value is calculated, and reverse feedback is used to update the weight of each connection line and the preference of each neuron in the neural network.

Feed-Forward

From input layer => hidden layer => output layer, the process of calculating the output values of all neurons layer by layer.

Back Propagation

Because there will be errors between the values of the output layer and the true values, we can use the mean square error to measure the error between the predicted value and the true value.

Mean square error

The goal of reverse feedback is to make the value of the E function as small as possible. The output value of each neuron is determined by the weight value corresponding to the connecting line of the point and the preference corresponding to the layer. Therefore, to minimize the error function, we need to adjust the w and b values to minimize the error function.

Update formula for weights and biases

By taking the partial derivatives of w and b for the objective function E, we can get the updated values of w and b. Now let’s take the partial derivative of w as an example for derivation.

Where η is the learning rate, which is usually 0.1 ~ 0.3, and can be understood as the step taken by each gradient. Note that the value of w_hj first affects the input value a of the jth output layer neuron, and then affects the output value y. According to the chain rule:

Use the chain rule to expand the partial derivatives of the weights

According to the definition of neuron output value a:

Find the partial derivative of w with respect to the function z

The formula for Sigmoid derivative is as follows, from which we can see that it is also very convenient to implement in computer:

Sigmoid function derivative

Then the update amount of weight w is:

Similarly, the update amount of b is:

However, these two formulas can only update the weights of the connection line between the output layer and the previous layer and the bias of the output layer. The reason is that the δ value depends on the true value y, but we only know the true value of the output layer and do not know the true value of each hidden layer, which makes it impossible to calculate the δ value of each hidden layer. Therefore, we hope to use the δ value of the l+1 layer to calculate the δ value of the l layer, which can be done through a series of mathematical transformations. This is the origin of the name of reverse feedback. The formula is as follows:

From the formula, we can see that we only need to know the weights of the next layer and the value of the neuron output layer to calculate the δ value of the previous layer. We can update all the weights and biases of the hidden layer by continuously using the above formula.

Before derivation, please observe the following picture:

Neurons in layers l and l+1

First, we see that the i-th neuron in layer l is connected to all neurons in layer l+1, so we can expand δ into the following formula:

That is to say, we can regard E as the z function of the input values of all neurons in the l+1 layer, and n in the above formula represents the number of neurons in the l+1 layer. After simplification, we can get the above formula.

The derivation process here only explains the key parts. If you want to view more detailed derivation content, you can click here to download a pdf document that I referred to during the learning process . The derivation process in it is very detailed. In addition, I also referred to the neural network part of machine learning written by Zhou Zhihua and the neural networks and deep learning content.

Python source code analysis

The source code comes from Michael Nielsen's deep learning online tutorial, but his content is all in English. I have annotated the source code based on my own understanding and the theoretical knowledge above. >> Click here to view the organized code and number recognition examples <<

The number of lines of code for the neural network implemented using Python is not large. It only contains a Network class. First, let's take a look at the construction method of this class.

 def __init__( self , sizes):
 """
 :param sizes: list type, storing the number of neurons in each layer of the neural network
 For example: sizes = [2, 3, 2] means that the input layer has two neurons,
 The hidden layer has 3 neurons and the output layer has 2 neurons
 """  
 # How many layers of neural network are there  
 self .num_layers = len(sizes)
 self .sizes = sizes
 # Remove the input layer and randomly generate bias values (0 - 1) for y neurons in each layer  
 self .biases = [np.random.randn(y, 1 ) for y in sizes[ 1 :]]
 # Randomly generate the weight value of each connection line (0 - 1)  
 self .weights = [np.random.randn(y, x)
 for x, y in zip(sizes[:- 1 ], sizes[ 1 :])]

FreedForward code.

 def feedforward( self , a):
 """
 The forward pass calculates the value of each neuron
 :param a: input value
 :return: The value of each neuron after calculation
 """  
 for b, w in zip( self .biases, self .weights):
 # Weighted sum and bias  
        a = sigmoid(np.dot(w, a)+b)
 return a

The source code uses Stochastic Gradient Descent (SGD) , which has a similar principle to gradient descent. The difference is that the stochastic gradient descent algorithm only takes a part of the samples in the data set to update the values of w and b in each iteration. It is faster than gradient descent, but it does not necessarily converge to the local minimum and may hover around the local minimum.

 def SGD( self , training_data, epochs, mini_batch_size, eta,
 test_data = None ):
 """
 Stochastic Gradient Descent
 :param training_data: input training set
 :param epochs: number of iterations
 :param mini_batch_size: number of small samples
 :param eta: learning rate
 :param test_data: test data set
 """  
 if test_data: n_test = len(test_data)
 n = len(training_data)
 for j in xrange(epochs):
 # Shuffle the training set to change its sorting order  
        random.shuffle(training_data)
 # Divide the training set according to the number of small samples  
 mini_batches = [
            training_data[k:k+mini_batch_size]
 for k in xrange( 0 , n, mini_batch_size)]
 for mini_batch in mini_batches:
 # Update w and b based on each small sample, the code is in the next paragraph  
 self .update_mini_batch(mini_batch, eta)
 # Output the accuracy of the neural network after each round of testing  
 if test_data:
 print   "Epoch {0}: {1} / {2}" .format(
                j, self .evaluate(test_data), n_test)
 else :
 print   "Epoch {0} complete" .format(j)

Update the values of w and b according to the partial derivatives obtained by the backprop method.

 def update_mini_batch( self , mini_batch, eta):
 """
 Update the values of w and b
 :param mini_batch: a part of the samples
 :param eta: learning rate
 """  
 # Create an empty matrix with all element values set to 0 according to the number of rows and columns of biases and weights  
    nabla_b = [np.zeros(b.shape) for b in   self .biases]
    nabla_w = [np.zeros(w.shape) for w in   self .weights]
 for x, y in mini_batch:
 # Calculate the partial derivatives of w and b based on the output y for each input x in the sample  
        delta_nabla_b, delta_nabla_w = self .backprop(x, y)
 # Accumulate and store partial derivatives delta_nabla_b and delta_nabla_w  
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
 # Update w and b according to the accumulated partial derivatives. Because a small sample is used here,  
 # So eta should be divided by the length of the small sample  
 self .weights = [w-(eta/len(mini_batch))*nw
 for w, nw in zip( self .weights, nabla_w)]
 self .biases = [b-(eta/len(mini_batch))*nb
 for b, nb in zip( self .biases, nabla_b)]

The following code is the core part of the source code, which is the implementation of the BP algorithm, including forward transmission and reverse feedback. Forward transmission has a single method in Network (the feedforward method mentioned above), which is used to verify the accuracy of the trained neural network. This method is mentioned below.

 def backprop( self , x, y):
 """
 :param x:
 :param y:
 :return:
 """  
    nabla_b = [np.zeros(b.shape) for b in   self .biases]
    nabla_w = [np.zeros(w.shape) for w in   self .weights]
 # Forward transmission  
 activation = x
 # The matrix that stores the values of neurons in each layer. The following loop will append the values of neurons in each layer  
 activations = [x]
 # Store the value of each neuron before sigmoid calculation  
 zs = []
 for b, w in zip( self .biases, self .weights):
        z = np.dot(w, activation)+b
 zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
 # Find the value of δ  
    delta = self .cost_derivative(activations[- 1 ], y) * \
        sigmoid_prime(zs[- 1 ])
 nabla_b[- 1 ] = delta
 # Multiply by the output value of the previous layer  
    nabla_w[- 1 ] = np.dot(delta, activations[- 2 ].transpose())
 for l in xrange( 2 , self .num_layers):
 # Start updating from the **l**th layer from the end. **-l** is a special syntax in Python that indicates starting the calculation from the lth layer from the end.  
 # Here we use the δ value of **l+1** layer to calculate the δ value of **l**  
 z = zs[-l]
 sp = sigmoid_prime(z)
        delta = np.dot( self .weights[-l+ 1 ].transpose(), delta) * sp
 nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l- 1 ].transpose())
 return (nabla_b, nabla_w)

The next step is to implement evaluate, calling the feedforward method to calculate the output layer neuron value (that is, the predicted value) of the trained neural network, and then compare the correct value with the predicted value to get the accuracy.

 def evaluate( self , test_data):
 # Get prediction results  
    test_results = [(np.argmax( self .feedforward(x)), y)
 for (x, y) in test_data]
 # Return the number of correctly identified  
 return sum(int(x == y) for (x, y) in test_results)

Finally, we can use this source code to train a neural network for handwritten digit recognition and output the evaluation results. The code is as follows:

 import mnist_loader
 import network 
 
 training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
 net = network.Network([ 784 , 30 , 10 ])
 net.SGD(training_data, 30 , 10 , 3.0 , test_data = test_data)
 # Output results  
 # Epoch 0: 9038 / 10000  
 # Epoch 1: 9178 / 10000  
 # Epoch 2: 9231 / 10000  
 # ...  
 # Epoch 27: 9483 / 10000  
 # Epoch 28: 9485 / 10000  
 # Epoch 29: 9477 / 10000

It can be seen that after 30 rounds of iteration, the accuracy of the handwriting recognition neural network is about 95%. Of course, setting different numbers of iterations, learning rates and sampling numbers will have an impact on the accuracy. How to adjust the parameters is also a technical job. This pit will be filled later.

Summarize

Advantages of Neural Networks:

The network essentially implements a mapping function from input to output, and mathematical theory has proven that it has the ability to implement any complex nonlinear mapping, which makes it particularly suitable for solving problems with complex internal mechanisms.

The network can automatically extract "reasonable" solution rules by learning a set of instances with correct answers, that is, it has self-learning ability.

The Internet has certain promotion and generalization capabilities.

Disadvantages of Neural Networks:

It is very sensitive to the initial weights and easily converges to the local minimum.

Easy to Over Fitting and Over Training.

There is no scientific guidance process for how to choose the number of hidden layers and neurons, and sometimes it feels like just guessing.

Application areas:

Common ones include image classification, autonomous driving, natural language processing, etc.

TODO

But in fact, there are still many pitfalls in training a neural network (such as the following four):

1. How to choose the values of hyperparameters, such as the number of neural network layers, the number of neurons in each layer, and the learning rate;
2. Since it is sensitive to initialization weights, how to avoid and correct it?
3. How to solve the gradient vanishing problem faced by Sigmoid activation function in deep neural networks?
4. What are L1 and L2 regularization to avoid overfitting?

refer to

[1] Zhou Zhihua Machine Learning

[2] Stanford University Machine Learning Online Course

[3] Parallel Distributed Processing (1986, by David E. Rumelhart, James L. McClelland), Chapter 8 Learning Internal Representations by Error Propagation

[4] How the backpropagation algorithm works

[5] Backpropagation Algorithm

[6] Chain Rule, a digital course from the Chinese University of Science and Technology in Taiwan, Youtube video, you need to use a VPN to access it. By the way, I would like to recommend their math-related videos because they are very easy to understand.

<<: Using deep neural networks to solve the problem of NER named entity recognition

>>: Use two pictures to tell you why your app freezes?

Recommend

iOS 14.6 quasi-official version released, don’t upgrade if it’s not necessary

The question now is whether it is worth upgrading...

This sanitary napkin brand ran 112 public account advertisements in three and a half months and achieved sales of 6.1 million!

I am Zhang Xiaoer, co-founder of Qingshenghuo san...

The “unspoken rules” of short video operation and creation!

With the development of the economy, people's...

Why did the epidemic suddenly break out in Jilin in 2022? Why is it so serious? Here comes the specific reason!

Recently, the local epidemic in Jilin has attracte...

In Baidu bidding, will there be any impact if the newly added creative is modified two or three days after it goes online?

In Baidu bidding, modifications to newly added cr...

The sales target of Xiangjie S9 this year is 60,000 units. Is this the only battle that will determine the success of BAIC Motor, which suffered a loss of 25 billion yuan?

"This car competes with the Mercedes-Benz S-...

The principle of BP algorithm in neural network and source code analysis of Python implementation

What is Gradient Descent and the Chain Rule

The structure of a neural network

The process of BP algorithm execution (forward pass and reverse update)

Python source code analysis

Summarize

TODO

refer to

Practical tips for beginners to learn Tik Tok feed flow, recommended to save!

The most detailed video promotion and marketing method in history!

How does Apple operate after-sales service? Teach you these 25 perfect details

Two shortcuts for APP social marketing, have you used them?

Design your APP interface like a designer

Programmers are high-level artists, not code farmers!

How fast does the cold front travel? Learn 8 things you don't know about the cold front in this article

The era of "traffic-centric" operations is over

The most comprehensive guide to short video distribution on Tik Tok, Kuaishou, etc.!

Introducing 5 data-driven and practical customer acquisition methods

Recommend

iOS 14.6 quasi-official version released, don’t upgrade if it’s not necessary

This sanitary napkin brand ran 112 public account advertisements in three and a half months and achieved sales of 6.1 million!

The “unspoken rules” of short video operation and creation!

Why did the epidemic suddenly break out in Jilin in 2022? Why is it so serious? Here comes the specific reason!

In Baidu bidding, will there be any impact if the newly added creative is modified two or three days after it goes online?

APP promotion: Where do new users come from and how to acquire them?

Apple releases iOS 13.3 / iPadOS 13.3 official version update

Weibo, WeChat and Douyin: New rules for brands

With Biden in office, China's high-end manufacturing industry is "quiet", but the other side of the ocean is "windy".

The sales target of Xiangjie S9 this year is 60,000 units. Is this the only battle that will determine the success of BAIC Motor, which suffered a loss of 25 billion yuan?

Lantu's alliance with Huawei is not only the best choice for both parties, but also an active correction of the HI model

I felt my vision was distorted and space was twisted. The doctor said I had walked into Alice in Wonderland.

Will OLED technology become the future trend of television?

Technology Fun: Top 10 Weird Technology Solutions of 2014

Where is the innovation in smart TVs that keep rehashing old stuff?