Standing on the shoulders of Shannon and Boltzmann, looking at the art and philosophy of deep learning In the article "Fascinating Data and Shannon's Perspective", I introduced my "epiphany": things are expressed by the information displayed by random variables at different levels. Random variables at different levels carry different information, and their combination affects the information expression of random variables at the upper level. The information expressed by random variables depends on the conditional probability distribution of the random variables. Many friends agree with this, and my peers think it is similar to deep learning. If you really understand the meaning of this sentence, you have actually understood the essence of deep learning. For example, when identifying a person, body shape, skin color, face shape, eyes, nose bridge, mouth corners, etc. are all random variables that characterize this person. In machine learning terms, they are called features, which can be explicit or implicit. Different contours, different blood types, different temperaments, different personalities, different IQs, EQs..., different levels of information comprehensively express this person. This hierarchical expression can describe the universe system, or even raindrops and microorganisms. The information entropy defined by Shannon based on probability distribution is a description of the uncertainty of the random variables of observed things. As entropy increases, uncertainty increases; as entropy decreases, uncertainty decreases. For example, the human face has different possible values, and the size and position of the eyes, the length of eyelashes, and their relative positions all have different ranges and possibilities of values. When the possible values of these different features are observed one by one, we gradually recognize the face. That is, if we observe that something walks like a duck, quacks like a duck, and looks like a duck, then we think it is a duck. Observation is mentioned here, which is the only way to obtain information about things and thus understand things. People need to know as many characteristics of this thing at different levels as possible, as detailed as possible, detailed enough to distinguish him/her/it from things that are extremely similar to him/her/it, that is, after exhausting the possibilities of these random variables, he/she/it is still different from others. It is difficult to say, but this information is recorded as data, which also explains to some extent why big data is so popular. Big data companies often use 360-degree customer portraits to fool people. We need to think about what 360 degrees is, how much do your relatives and friends know about you, and do you know yourself 360 degrees? Can you draw 3.6 degrees? Observation has errors, and in many cases it cannot be done directly. For example, if you want to calculate the average length of fish in a lake, but the distribution of fish of different lengths in the lake is unclear, how can you fish them out and measure them? Markov said this is easy. You construct a chain of probability distribution migration. From the initial distribution P0, after n transfers with a transfer probability of P, it will definitely converge to the steady-state distribution Pn. Don't understand? Gibbs understood and invented the Gibbs sampling method. However, the transfer matrix here needs to meet the detailed stability condition, that is, the probability of mutual transfer is equal, otherwise it will be unstable (pay attention to the blackboard). Remember the premise of "stability". In a recent paper titled "Why does deep and cheap learning work so well?", several Harvard physicists have made a theoretical review of the hierarchical random expression mentioned above. In my opinion, it is simply a great summary, which makes you have to admire the profound skills of Harvard physicists. Many computer scientists have been talking about fancy neural network architectures with multiple layers, stochastic gradient optimization solutions, and inexplicable nonsensical regularization. These are all "techniques", and here they have found the "Tao" as their theoretical basis. The authors of the paper use the following figure to summarize the three most typical problems in the field of deep learning: unsupervised learning, classification and prediction of supervised learning. There seems to be a typo in the paper, but AI may not recognize it. These three types of problems are ultimately about using neural networks to approximate probability distributions: imagine a joint probability distribution of (x, y), or a conditional probability distribution of x under the condition that y occurs, or interchangeably. The training process is the process of finding this approximate probability distribution function. How do popular deep learning algorithms solve these probability distributions? We first calculate the Shannon information entropy contained in the observation data (training data) of these random variables, and determine that this is the maximum possible information display of the system (maximum likelihood), and then minimize the remaining part, that is, to find the maximum and minimum values, and use numerical methods. There are many assumptions here, such as convex functions, such as Lipschitz continuity (which can also be understood as a stationary assumption in a sense), and many "arithms", such as Lagrange multipliers, such as stochastic gradient descent, which are all "pleasing to the eye" calculus. By appropriately constraining this Shannon entropy method, you can get the familiar least squares method, which you must have used to fit a straight line during college physics experiments. The basic assumption of finding these probability distribution functions through training is that the system is in a relatively stable state. For an open system that is evolving rapidly, the probability method should not be suitable. For example, machine translation, AI translation should be able to handle the language description of relatively stable things. For new network manifold languages, or niche new things, such as "Empresses in the Palace" in the past few years, statistical AI translation may not be so handy. Zhou Hongyi said at the 360 Marketing Festival when summarizing the performance of AI in live broadcasts, "Computers define all cone-shaped faces as beauties." Another example is the Boltzmann machine. The distribution of Hamiltonian free energy actually has an implicit assumption of relative stability of the system (Boltzmann distribution is the energy distribution law of "equilibrium" gas molecules). For unstable systems, we have to resort to Prigogine. Therefore, when we cheer for the continuous improvement of AI's "intelligence", we must calmly recognize the situations in which it may not be applicable. Please believe that there is no universal truth, except for this sentence itself. I am the only one who doubts that this matter cannot be doubted, and everything else is worthy of doubt. I also have a piece of advice. No matter how powerful AI is, please try to be cautious in trusting predictions based on statistics. Look at this farmer's proverb and you will understand: I just want to know where I will die in the future, so that I won't go there. Author: Wang Qingfa, data expert, member of the Chief Data Officer Alliance expert group |
>>: A steak for 9.9 dollars comes with a frying pan. Is it really edible?
Why life prefers a certain chirality is a profoun...
How much do you know about the Shenzhou manned sp...
There are many factors that affect sleep, includi...
In our daily lives, we are overwhelmed by marketi...
In the past two years, with Tesla's vigorous ...
In fact, the concept of grabbing volume has exist...
For every operator, the word "attracting new...
For Borgward, a car brand with German roots that ...
Overview Since the launch and open source of Swif...
Some time ago, rumors about Foxconn moving to the...
Except for the theatrical version of Detective Co...
Tencent Digital News (Yokii) Since this year, as ...
Leviathan Press: We know that "what color yo...
Gree's Sejie mobile phone, which was launched ...