Standing on the shoulders of Shannon and Boltzmann, looking at the art and philosophy of deep learning

Standing on the shoulders of Shannon and Boltzmann, looking at the art and philosophy of deep learning

In the article "Fascinating Data and Shannon's Perspective", I introduced my "epiphany": things are expressed by the information displayed by random variables at different levels. Random variables at different levels carry different information, and their combination affects the information expression of random variables at the upper level. The information expressed by random variables depends on the conditional probability distribution of the random variables. Many friends agree with this, and my peers think it is similar to deep learning.

If you really understand the meaning of this sentence, you have actually understood the essence of deep learning. For example, when identifying a person, body shape, skin color, face shape, eyes, nose bridge, mouth corners, etc. are all random variables that characterize this person. In machine learning terms, they are called features, which can be explicit or implicit. Different contours, different blood types, different temperaments, different personalities, different IQs, EQs..., different levels of information comprehensively express this person. This hierarchical expression can describe the universe system, or even raindrops and microorganisms.

The information entropy defined by Shannon based on probability distribution is a description of the uncertainty of the random variables of observed things. As entropy increases, uncertainty increases; as entropy decreases, uncertainty decreases. For example, the human face has different possible values, and the size and position of the eyes, the length of eyelashes, and their relative positions all have different ranges and possibilities of values. When the possible values of these different features are observed one by one, we gradually recognize the face. That is, if we observe that something walks like a duck, quacks like a duck, and looks like a duck, then we think it is a duck.

Observation is mentioned here, which is the only way to obtain information about things and thus understand things. People need to know as many characteristics of this thing at different levels as possible, as detailed as possible, detailed enough to distinguish him/her/it from things that are extremely similar to him/her/it, that is, after exhausting the possibilities of these random variables, he/she/it is still different from others. It is difficult to say, but this information is recorded as data, which also explains to some extent why big data is so popular. Big data companies often use 360-degree customer portraits to fool people. We need to think about what 360 degrees is, how much do your relatives and friends know about you, and do you know yourself 360 degrees? Can you draw 3.6 degrees?

Observation has errors, and in many cases it cannot be done directly. For example, if you want to calculate the average length of fish in a lake, but the distribution of fish of different lengths in the lake is unclear, how can you fish them out and measure them? Markov said this is easy. You construct a chain of probability distribution migration. From the initial distribution P0, after n transfers with a transfer probability of P, it will definitely converge to the steady-state distribution Pn. Don't understand? Gibbs understood and invented the Gibbs sampling method. However, the transfer matrix here needs to meet the detailed stability condition, that is, the probability of mutual transfer is equal, otherwise it will be unstable (pay attention to the blackboard). Remember the premise of "stability".

In a recent paper titled "Why does deep and cheap learning work so well?", several Harvard physicists have made a theoretical review of the hierarchical random expression mentioned above. In my opinion, it is simply a great summary, which makes you have to admire the profound skills of Harvard physicists. Many computer scientists have been talking about fancy neural network architectures with multiple layers, stochastic gradient optimization solutions, and inexplicable nonsensical regularization. These are all "techniques", and here they have found the "Tao" as their theoretical basis. The authors of the paper use the following figure to summarize the three most typical problems in the field of deep learning: unsupervised learning, classification and prediction of supervised learning. There seems to be a typo in the paper, but AI may not recognize it. These three types of problems are ultimately about using neural networks to approximate probability distributions: imagine a joint probability distribution of (x, y), or a conditional probability distribution of x under the condition that y occurs, or interchangeably. The training process is the process of finding this approximate probability distribution function.

How do popular deep learning algorithms solve these probability distributions? We first calculate the Shannon information entropy contained in the observation data (training data) of these random variables, and determine that this is the maximum possible information display of the system (maximum likelihood), and then minimize the remaining part, that is, to find the maximum and minimum values, and use numerical methods. There are many assumptions here, such as convex functions, such as Lipschitz continuity (which can also be understood as a stationary assumption in a sense), and many "arithms", such as Lagrange multipliers, such as stochastic gradient descent, which are all "pleasing to the eye" calculus. By appropriately constraining this Shannon entropy method, you can get the familiar least squares method, which you must have used to fit a straight line during college physics experiments.

The basic assumption of finding these probability distribution functions through training is that the system is in a relatively stable state. For an open system that is evolving rapidly, the probability method should not be suitable. For example, machine translation, AI translation should be able to handle the language description of relatively stable things. For new network manifold languages, or niche new things, such as "Empresses in the Palace" in the past few years, statistical AI translation may not be so handy. Zhou Hongyi said at the 360 Marketing Festival when summarizing the performance of AI in live broadcasts, "Computers define all cone-shaped faces as beauties."

Another example is the Boltzmann machine. The distribution of Hamiltonian free energy actually has an implicit assumption of relative stability of the system (Boltzmann distribution is the energy distribution law of "equilibrium" gas molecules). For unstable systems, we have to resort to Prigogine. Therefore, when we cheer for the continuous improvement of AI's "intelligence", we must calmly recognize the situations in which it may not be applicable. Please believe that there is no universal truth, except for this sentence itself. I am the only one who doubts that this matter cannot be doubted, and everything else is worthy of doubt.

I also have a piece of advice. No matter how powerful AI is, please try to be cautious in trusting predictions based on statistics. Look at this farmer's proverb and you will understand: I just want to know where I will die in the future, so that I won't go there.

Author: Wang Qingfa, data expert, member of the Chief Data Officer Alliance expert group

<<: A Chinese writer won the Hugo Award again. Who is the painter of "The Painter of Space and Time"?

>>: A steak for 9.9 dollars comes with a frying pan. Is it really edible?

Do you catch a cold when the season changes? Be alert if you have these 5 symptoms

What is the specific list of closed communities in Shanghai today 2022? How long will it take to be unblocked? Attached is the latest list of unblocked communities

Blog

Do you know the correct way to resist cold waves?

Recommend

There is a threshold for entry into 1688 Chengxintong’s virtual information difference products, but the returns are considerable!

Everyone knows that the 1688 platform is a wholes...

A foldable phone more suitable for Apple users, why does vivo X Fold 3 become a representative of new quality productivity?

As the growth of the smartphone market slows down...

China Telecom's net profit in the first quarter of 2020 was 5.822 billion yuan, down 2.2% year-on-year. The number of 5G package users was about 16.61 million.

On April 23, China Telecom released the unaudited...

Standing on the shoulders of Shannon and Boltzmann, looking at the art and philosophy of deep learning

Do you catch a cold when the season changes? Be alert if you have these 5 symptoms

He discovered the Mayan "Lost City" in the rainforest, relying on a free map?

The culprit for “loss of voice” may not be the air conditioner, there may be a serious problem behind it!

Why do our brains love color so much?

What is the specific list of closed communities in Shanghai today 2022? How long will it take to be unblocked? Attached is the latest list of unblocked communities

Do you know the correct way to resist cold waves?

2020 Anhui College Entrance Examination Score Line Announced 2020 Anhui College Entrance Examination Score Line List

Dentsu Digital team was dismissed: employees collectively reported CEO Zhang Zhexiang for bribing GAC executives

Babies who love to sleep are not really lazy

Why is the high-speed rail so fast? What is the source of its power?

Recommend

There is a threshold for entry into 1688 Chengxintong’s virtual information difference products, but the returns are considerable!

WeChat Android version 8.0.17 beta version released: sending messages has animation display (with developer content)

2019 Spring Festival Information Flow Marketing Analysis Report!

Research on cloud terminal transmission protocol based on SPICE protocol

A foldable phone more suitable for Apple users, why does vivo X Fold 3 become a representative of new quality productivity?

LeTV's 919 daily sales of 400 million yuan is a revolutionary baptism for traditional enterprises

This Is Us (All Five Seasons) (Updated to Season 5) HD English Subtitles

Bluetooth 4.0 BLE Development

Massive advertising strategy in the cosmetics industry!

How much does it cost to make an animation mini program in Dingxi?

WeChat voice messages are automatically played in Moments! Netizens: Parading in the streets is a social death

China Telecom's net profit in the first quarter of 2020 was 5.822 billion yuan, down 2.2% year-on-year. The number of 5G package users was about 16.61 million.

"Love Stocks·Big Sister of Fortune" synchronous message group, mid- and long-term value investment

Friends who want to buy Lavida, please wait. The new and high-looking Eado XT is coming soon.

Can we still happily use public free Wi-Fi?