AI in Plain Language: Is it really that difficult to understand deep learning? Junior high school math in just 10 minutes

Today, with AI playing such an important role in the industry, deep learning, as an important research branch, has appeared in almost all popular AI application fields, including semantic understanding, image recognition, speech recognition, natural language processing, etc. Some people even believe that current artificial intelligence is equivalent to the field of deep learning. In this era of artificial intelligence, if you are an ambitious programmer, student, or enthusiast who does not understand the hot topic of deep learning, you seem to be out of touch with the times.

[[191644]]

However, deep learning requires a lot of math, including calculus, linear algebra, probability theory, and mathematical statistics, which makes most young people with ideals and ambitions hesitate to move forward. So the question is, do you really need this knowledge to understand deep learning? I won't keep you in suspense, the title has already explained it.

Some time ago, the editor browsed various community forums and found a reply post on deep learning that is very suitable for beginners to learn. It uses humorous vernacular and examples to analyze the process of deep learning in a simple and easy-to-understand way. After communicating with Mr. Yang Anguo, who works in the field of artificial intelligence at Siemens, I obtained the content editing authorization and reorganized and revised the content to make it more understandable. I hope everyone can understand deep learning.

There is a lot of information on deep learning on the Internet, but it seems that most of it is not suitable for beginners. Teacher Yang summarized several reasons:

1. Deep learning does require a certain mathematical foundation. If it is not explained in a simple and easy-to-understand way, some readers will be afraid of the difficulty and give up prematurely.

2. Books or articles written by Chinese or Americans are generally more difficult.

The mathematical foundation required for deep learning is not as difficult as you might think. You only need to know derivatives and related function concepts. Haven't learned advanced mathematics? Very good. This article is actually intended to be understandable to liberal arts students. You only need to have learned junior high school mathematics.

In fact, there is no need to be afraid of difficulties. I admire Li Shufu's spirit. In a TV interview, Li Shufu said: Who says Chinese people can't make cars? What's so difficult about making cars? It's just four wheels and two rows of sofas. Of course, his conclusion is biased, but his spirit is commendable.

What is a derivative? It's nothing more than the rate of change.

For example: Wang Xiaoer sold 100 pigs this year, 90 last year, and 80 the year before last... What is the rate of change or growth rate? An increase of 10 pigs per year, so simple. It should be noted that there is a time variable here - year. The growth rate of Wang Xiaoer's pig sales is 10/year, that is, the derivative is 10.

Function y=f(x)=10x+30. Here we assume that Wang Xiaoer sold 30 pigs in the first year, and the number increased by 10 each year thereafter. x represents time (years), and y represents the number of pigs.

Of course, this is the case when the growth rate is fixed. In real life, in many cases, the amount of change is not fixed, that is, the growth rate is not constant. For example, the function may be like this: y=f(x)=5x²+30, where x and y still represent time and number of heads, but the growth rate has changed. How to calculate this growth rate will be discussed later. Or you can simply remember a few derivative formulas.

Deep learning has another important mathematical concept: partial derivative. How do you understand the partial derivative? Is it the partial derivative of a headache, or do you insist on taking the derivative when I don’t allow you to? Neither. Let’s take Wang Xiaoer selling pigs as an example. Just now we talked about the x variable being time (years), but the pigs sold are not only related to time. With the growth of the business, Wang Xiaoer not only expanded the pig farm, but also hired many employees to raise pigs together. So the equation changes again: y=f(x)=5x₁²+8x₂ + 35x₃ +30

Here x₂ represents the area, x₃ represents the number of employees, and of course x₁ is time.

As we said above, the derivative is actually the rate of change, so what is the partial derivative? The partial derivative is nothing more than the rate of change of a certain variable when there are multiple variables. In the above formula, if we take the partial derivative for x₃, that is, how much the employees contribute to the growth rate of pigs, or how many pigs increase with the increase of (each) employee, here it is equal to 35---for every additional employee, 35 more pigs are sold. When calculating partial derivatives, other variables can be regarded as constants, which is very important. The rate of change of the constant is 0, so the derivative is 0, so we only need to take the derivative of 35x₃, which is equal to 35. It is similar to taking the partial derivative for x₂.

We use a symbol to represent the partial derivative: for example, y/x₃ means the partial derivative of y with respect to x₃.

After all this nonsense, what does this have to do with deep learning? Of course it has something to do with it. Deep learning uses neural networks to solve linear inseparable problems. We will discuss this later, and you can also search for related articles online. Here I will mainly talk about the relationship between mathematics and deep learning. Let me show you a few pictures:

Figure 1. So-called deep learning is a neural network with many hidden layers.

Figure 2. How to find partial derivatives when there is only one output

Figure 3. How to find partial derivatives when there are multiple outputs.

The last two pictures are from a book about deep learning written by a Japanese. I think it is well written, so I stole the pictures to use. The so-called input layer, output layer, and middle layer correspond to the Chinese: input layer, output layer, and hidden layer. Don't be scared by these pictures, it is actually very simple. Let's take another example, let's take flirting as an example. We can roughly divide the love between men and women into three stages:

1. First love period. This is equivalent to the input layer of deep learning. There must be many factors that attract you to others, such as height, figure, face, education, personality, etc. These are the parameters of the input layer, and the weights may be different for each person.

2. The passionate love period. Let's let it correspond to the hidden layer. During this period, both parties have various adjustments, such as daily necessities.

3. Stabilization period. Corresponding to the output layer, whether it is suitable or not depends on how well it is run-in. As we all know, running-in is very important. How to run-in? It is a process of continuous learning, training and correction! For example, if your girlfriend likes strawberry cake, and you bought blueberry, and her feedback is negative, you should not buy blueberry next time, but strawberry instead.

After reading this, some guys may start to adjust the parameters for their girlfriends. I'm a little worried, so I'll add something. Flirting with girls is the same as deep learning. We must prevent both underfitting and overfitting. The so-called underfitting, for deep learning, means insufficient training and insufficient data, just like you don't have enough experience in flirting with girls. To achieve fitting, sending flowers is of course the most basic, and you also need to improve other aspects, such as improving your sense of humor when speaking, etc. Because the focus of this article is not flirting with girls, I won't talk about it in detail. One thing I need to mention here is that underfitting is certainly not good, but overfitting is even more inappropriate. Overfitting is the opposite of underfitting. On the one hand, if you overfit, she will think you have the potential of Mr. Edison Chen. More importantly, everyone's situation is different, just like deep learning, the training set works well, but the test set doesn't! As far as flirting with girls is concerned, she will think you are greatly influenced by your ex (training set), which is a taboo! If you give her this impression, you will be annoyed in the future, remember it!

Deep learning is also a process of continuous adjustment. At the beginning, a standard parameter is defined (these are empirical values, just like sending flowers on Valentine's Day and birthdays), and then it is constantly revised to obtain the weights between each node in Figure 1. Why is it necessary to adjust in this way? Imagine that we assume that deep learning is a child. How do we teach him to read pictures? You must first show him pictures and tell him the correct answer. You need a lot of pictures to continuously teach him and train him. This training process is actually similar to the process of solving the weights of a neural network. When testing in the future, you only need to give him pictures, and he will know what is in the picture.

So the training set is actually pictures with correct answers shown to children. For deep learning, the training set is used to solve the weights of the neural network and finally form a model; and the test set is used to verify the accuracy of the model.

For a trained model, as shown in the figure below, the weights (w1, w2...) are all known.

Figure 4

Figure 5

We know that it is easy to calculate from left to right like above. But as we said above, the test set has pictures and expected correct answers. How to calculate w1, w2, etc. in reverse?

After going around in circles for a long time, it's finally time to ask for a deflection. The current situation is:

1. We assume that a neural network has been defined, such as how many layers, how many nodes in each layer, and the default weights and activation functions. When the input (image) is determined, the output value can only be changed by adjusting the parameters. How to adjust and how to run-in? Just now we talked about that each parameter has a default value. We add a certain value ∆ to each parameter and see what the result is? If the gap becomes larger when the parameter is increased, you know, then we have to reduce ∆, because our goal is to make the gap smaller; vice versa. So in order to adjust the parameters to the best, we need to understand the rate of change of the error with respect to each parameter, which is to find the partial derivative of the error with respect to the parameter.

2. There are two points here: one is the activation function, which is mainly to make the entire network have nonlinear characteristics. As we mentioned earlier, in many cases, linear functions cannot properly classify inputs (in many cases, recognition is mainly for classification), so the network must learn a nonlinear function. Here, the activation function is needed, because it is nonlinear, so the entire network also has nonlinear characteristics. In addition, the activation function also keeps the output value of each node within a controllable range, which is convenient for calculation.

It seems that this explanation is still not popular. In fact, it can be used as an analogy for flirting with girls: girls don’t like days like plain water, because this is linear. Of course, life needs some romance. As for this activation function, I feel it is similar to the little romance and surprises in life, right? At each stage of getting along, you need to activate it from time to time to create some little romance and surprises. For example, most girls can’t move their feet when they see cute little cups and porcelain. Then give her a special style on her birthday to make her want to cry. I mentioned earlier that men should be humorous, in order to make her laugh, and at the right time, make her cry with excitement. Cry and laugh, do it a few more times, and she will be inseparable from you. Because your nonlinear characteristics are too strong.

Of course, too much is as bad as too little, and the more surprises, the better, but without them, it will be just like plain water. It is like adding an activation function to each layer. Of course, it is not necessary to add an activation function to each layer, but without them, that is not acceptable.

The key is how to find the partial derivative. Figures 2 and 3 show the derivation methods respectively. It is actually very simple. You can find the partial derivatives one by one from right to left. Finding the partial derivatives of adjacent layers is actually very simple. Because it is linear, the partial derivative is actually the parameter itself, which is similar to solving the partial derivative of x₃. Then multiply each partial derivative.

There are two points here: one is the activation function. Actually, the activation function is nothing special. It is just to make the output of each node in the range of 0 to 1, so that it is easier to calculate. Therefore, another layer of mapping is done on the result. Anyway, it is one-to-one. Due to the existence of the activation function, it must also be taken into account when finding the partial derivative. The activation function is generally sigmoid, and Relu can also be used. The derivation of the activation function is actually very simple:

Derivative: f'(x)=f(x)*[1-f(x)]

In this regard, you can look up advanced mathematics if you have time, but if you don’t have time, just remember it. As for Relu, it is even simpler, that is, f(x) when x<0 y is equal to 0, otherwise, y is equal to x. Of course, you can also define your own Relu function, for example, when x is greater than or equal to 0, y is equal to 0.01x, which is also OK.

The other is the learning coefficient. Why is it called the learning coefficient? We just talked about ∆ increment. How much should it increase each time? Is it equivalent to the partial derivative (rate of change)? Experience tells us that we need to multiply it by a percentage, which is the learning coefficient. Moreover, this coefficient can change as the training progresses.

Of course, there are some very important basic knowledge, such as SGD (stochastic gradient descent), mini batch and epoch (for the selection of training sets). Due to space limitations, I will talk about it later. In fact, you can just refer to Li Hongyi's article. In fact, the description above is mainly about how to adjust parameters, which belongs to the primary stage. It is also mentioned above that before adjusting the parameters, there are default network models and parameters. How to define the initial model and parameters? Further in-depth understanding is required. However, for general engineering, you only need to adjust the parameters on the default network, which is equivalent to using an algorithm; for scholars and scientists, they will invent algorithms, which is very difficult. Salute to them!

Finally, Professor Yang recommended a very good article: "Understanding Deep Learning in 1 Day", a 300-page PPT written by Professor Li Hongyi from Taiwan. It is very good. It is no exaggeration to say that it is the most systematic and easy-to-understand article on deep learning.

Here is the link to the slideshare:

http://www.slideshare.net/tw_dsconf/ss-62245351?qid=108adce3-2c3d-4758-a830-95d0a57e46bc&v=&b=&from_search=3

Students who don’t have a VPN can download it from Teacher Yang’s network disk:

http://pan.baidu.com/s/1nv54p9R Password: 3mty

This article quotes Mr. Yang’s reply on Zhihu:

https://www.zhihu.com/question/26006703/answer/129209540

【Editor's recommendation】

Transfer Learning: How to Learn Deeply When Data Is Insufficient
There is no artificial intelligence in the world? Are we fooled by deep learning?
How to use deep learning to recommend movies? Teach you to make your own recommendation system!
How to implement image completion based on deep learning in TensorFlow
What are machine learning and deep learning? Faizan Shaikh will help you answer

<<: Best Practices for Android Custom BaseAdapter

>>: Android GC Principle Exploration

Why did Hawking warn “don’t reply to alien signals”?

AI in Plain Language: Is it really that difficult to understand deep learning? Junior high school math in just 10 minutes

Why did Hawking warn “don’t reply to alien signals”?

Apple Swift language open source project selection summary

Brand KOL marketing and growth!

Practice the art of Qiankun Da Neng Yi and handle app promotion, one person is enough! (Complete)

After the "two positives", the new crown will end in this way

Baidu and Google are both focusing on information flow. What is the difference between them?

Healing brand advertising becomes mainstream!

Dream Sky Cabin, always ready!

A must-have tool for parents to help their children! Youdao Children's Dictionary is launched in 2019

What are the reasons for website promotion failure? What is website promotion?

Recommend

Ten product details analysis to show you how big manufacturers design!

How does the game audio and video SDK solve the problem of echo cancellation?

Sina Fuyi Mobile Game Industry Launch Strategy

How to plan a large event promotion from 0 to 1?

Short video operation topic planning!

Chocolate, ice cream... add it to become "luxurious"? Latest notice: non-food raw materials

Smart home is booming again, Changhong CHIQ is moving forward steadily

Let’s share 2 billion together! A complete guide to Douyin’s hometown reunion event

Stock up on New Year’s goods, and memorize these food storage tips!

How do ants and plants become allies?

Enable the 6-layer protection of WeChat Pay and never worry about losing money in WeChat again

What are the things to pay attention to when leasing IDC large bandwidth?

up to date! Data rankings of 48 information flow platforms, which channel leads the list?

India bans another 118 Chinese apps, including PUBG, making it difficult for Chinese companies going overseas to survive

Saw off a leg in 90 seconds! Before anesthesia, surgery was a "speed race"?