What is the connection between neural networks and mathematics? You will understand after reading this

Those of us who have learned about neural networks all know that there is a very common training method for neural networks, the BP training algorithm. Through the BP algorithm, we can continuously train the network, and eventually make the network infinitely approximate a function we want to fit. In the end, the trained network can perform well on both the training set and the test set!

So what exactly is the BP algorithm? Why can we move towards the optimal value step by step through the BP algorithm (even if it may be a local optimum, not a global optimum, we can also reach the global optimum through other methods), is there any mathematical principle to support it? In the past few days, I have sorted out some knowledge points in this area and wrote them down, first for record, and second for sharing with everyone to prevent misunderstandings and learn and communicate together.

For details about the BP algorithm, please refer to this Zhihu column (which walks through the BP process in detail to deepen your understanding). Let's solve this problem and see why the BP algorithm can lead to better results step by step. First, let's look at the operating principle of the neural network. If we have the following simple network, as shown in the figure:

We define the symbols as follows:

Then we can get the following formula by forward propagation once:

If the loss function C is defined as

Then we hope that the values predicted by the trained network are as close to the real values as possible. Let's ignore the SGD method for now. The most violent way is to hope that for a training data, C can reach the minimum. In the C expression, we can regard the C expression as a function of all w parameters, that is, to find the maximum value of this multivariate function. Then we have successfully introduced a neural network problem into the path of mathematical optimization.

---------------------------Dividing line---------------------------

Well, we have now successfully transformed the problem that a neural network needs to solve into the optimization of a multivariate function. The question now is how to modify w to make C get closer and closer to the minimum value. The common method is the gradient descent method (why the opposite direction of the gradient in the gradient descent method is the fastest direction, please refer to my next article, which is not the main topic of this article). It may be a bit abstract here, so let's give a very simple example.

Suppose our network is very simple, as shown below (the symbols are the same as above):

Then we can get:

Only the w parameter is unknown, so C can be regarded as a binary function about w (the advantage of a binary function is that we can visualize it in three-dimensional coordinates, which is easy to understand~). The picture comes from the Internet:

Let's go through the algorithm process:

We first start by randomly initializing the w parameter, which is equivalent to corresponding to point A on the graph.

Our goal is to reach the lowest point F, so we move in the opposite direction of the gradient. The formula is as follows:

The size of each step is determined by the previous learning rate. If the next step reaches point B, and the iteration continues like this, if there is only one optimal point in the world, we can reach point F after several iterations, thus solving our problem.

Well, above we give a simple example of a binary function. From analysis to the final result, we can visualize the final steps. If the network becomes complex, the principle of finding the optimal value of a multivariate function is exactly the same! At this point, I have finished the knowledge points to be covered in this article. Welcome friends to point out mistakes and communicate with us~

---------------------------Dividing line---------------------------

When I was studying, I already understood the above knowledge, but I was thinking that since I finally got a multivariate function about w, why don’t I just take the partial derivative of each w and then update it directly? Why did the popularity of neural networks require the introduction of the BP algorithm to revive? My question is why can’t we take the partial derivative directly? Why must the BP algorithm appear to make neural networks so applicable? Here are my thoughts and understandings (welcome to communicate~)

1. Why can’t we find the derivative directly?

In neural networks, due to the existence of activation functions, many times when we are at the end of the cost function, the cost function containing the w parameter is not a linear function, such as the simplest

It is impossible to obtain an analytical solution by differentiating this function with respect to w, which explains why it cannot be directly differentiated.

2. Since we cannot directly derive the derivative, can we approximate the derivative? For example, we can use

According to this formula, we can approximately calculate the derivative of each parameter. The smaller the interval, the closer it is. So why can't we do this, but have to wait until the BP algorithm is proposed? Thinking...

A: It is because of the amount of computation. Assuming that there are 1 million weights in our network, every time we calculate the partial derivative of the weight, we need to calculate the change value, and the change value must go through a complete forward propagation. So for each training example, we need 1 million and one forward propagations (and one more to calculate C), while our BP algorithm only needs one back propagation to calculate the partial derivatives of all parameters, which takes a total of two propagations. At this point, I think we have solved why we can't use approximate methods, because the speed is too slow and the computational complexity is too high~ For each propagation, if there are many parameters, the amount of matrix calculations is very large, and the previous machine speed cannot bear it at all~ So until the emergence of BP, the application speed of neural networks has been accelerated.

The above is just my personal understanding. Thanks to Tokugawa for his help! Welcome friends to ask questions and exchange ideas~
The following are the materials and blogs I used for my study:
《Neural networks and deep learning》 Need Chinese version? Welcome to leave a message Email Zero-based entry deep learning (1) - Perceptron

<<: What AI can and cannot do for cybersecurity

>>: Android thread communication handler

The cute incarnations of ice and snow: How Bing Dwen Dwen and Shuey Rhon Rhon were born

How much does it cost to produce Zhangye Printing Mini Program? Zhangye printing and printing applet production price inquiry

Blog

Science illustration | The largest pumped storage power station in the west is adding bricks and tiles to the new energy system

...

Hefei private detective website SEO optimization practical training case!

Earlier, Feng Chao from Dongguan had an idea, whe...

What is the connection between neural networks and mathematics? You will understand after reading this

The cute incarnations of ice and snow: How Bing Dwen Dwen and Shuey Rhon Rhon were born

Interesting story | Does land have more survival advantages than the ocean?

Chronic stress may be affecting your brain without you even realizing it

What are those little pimples around the eyes? Don't pull them, or they may cause big problems!

Like an alchemy furnace! The Inner Mongolia volcano can "spray" Ulan Hada and "spit" agate!

The Year of the Tiger is here! A review of China’s scientific achievements in the Year of the Tiger

Experience the TV version of the martial arts online game "Fighting Arena" on the big screen

Review of NetEase Cloud Classroom's fission activity

Flower arrangement training advertising case!

How much does it cost to produce Zhangye Printing Mini Program? Zhangye printing and printing applet production price inquiry

Recommend

Reverse product thinking: What is the logic of "use it and leave it"?

Huawei App Market Brand Resource Bidding Promotion Service Rules!

Speed! How to keep 15GB of OneDrive space before 3pm

Winter health guide for diabetics

10 Android Browsers to Improve Your Web Browsing Experience in 2018

Ping An Health Product Analysis

Attention! If you wash your down jacket like this again, it may be ruined! See here for the correct cleaning method!

The quickest way to lose weight: Can you lose weight immediately by just lying down?

Why are there so many bad domestic movies this year?

Xiaohongshu Promotion丨How to quickly create popular notes!

Is yellow urine a sign of a health problem? Click here to find the answer →

New media operation: the hard-core method to create millions of traffic for new accounts!

[Daiyu Finance] Pantaoyuan Swallow Returns Combat Method Swallow Combat Method Second Move Turns Around

Science illustration | The largest pumped storage power station in the west is adding bricks and tiles to the new energy system

Hefei private detective website SEO optimization practical training case!