In today's booming artificial intelligence, it seems not easy to understand its principles. In fact, its core problem is still mathematics, and it is not complicated, much simpler than you think. If you have climbed a mountain (even better, got lost in the mountains), then you can understand its underlying strategy. By Jordan Ellenberg Translations | Hu Xiaorui, Zhong Yi My friend Meredith Broussard, a professor at New York University who specializes in machine learning and its societal implications, was recently tasked with explaining to a national television audience what artificial intelligence is and how it works in about two minutes. She explained to the host who interviewed her that artificial intelligence is not a killer robot, nor is it a humanoid robot whose intelligence dwarfs that of humans but has no emotions. She told the host: "We just need to remember that its basic principle is mathematics, there is nothing to be afraid of!" The hosts’ pained expressions suggested that they’d rather be talking about killer robots. But Meredith hit the nail on the head. Since I don’t have to stick to the two-minute time limit, let me take over the task and explain the math behind machine learning, because this “big idea” is simpler than you think. Machine learning is like climbing a mountain Imagine you are not a machine, but a mountain climber, trying to reach the top of a mountain. But you don't have a map, and there are only trees and bushes around you, and there is no vantage point from which you can see the wider view. So, how do you get to the top? One strategy is to assess the slope of the ground beneath you. As you go north, the ground might rise slightly, and as you go south, the ground might fall slightly. When you turn northeast, you find a steeper uphill slope. You walk around in a little circle, surveying all the possible directions you could go, and find that one direction has the steepest uphill slope, so you take a few steps in that direction. Then you make another circle and pick the steepest uphill slope of all the possible directions you could go, and so on. Now you know how machine learning works! OK, maybe a little more than that, but this concept called gradient descent is at the heart of machine learning. It's essentially trial and error: You try a bunch of possible actions and pick the one that helps you get out of a difficult situation the most. The "gradient" with respect to a direction is a mathematical concept that refers to "how much the elevation changes when you take a small step in that direction," or the slope of the ground along the way you're walking. Gradient descent is an algorithm that uses mathematical language to create "a clear rule about what to do in every situation you might encounter." The rule is: consider all the directions you could go, find the one with the largest gradient, and take a few steps in that direction; repeat. Plot your route to the top of the mountain on a topographic map, which should look something like Figure 1. Figure 1 Here's another cool piece of geometry. When you use gradient descent to find your way, your path on the terrain map must be perpendicular to the contour lines. But what does this have to do with machine learning? Suppose I’m not a mountain climber, but a computer trying to learn something, like AlphaGo or GPT-3 (the AI language model that generates long, plausible, and disturbingly accurate English texts). But to begin, let’s assume I’m a computer trying to learn what a cat is. How do I do this? The answer is: take a learning approach similar to that of infants. Infants live in a world where adults often point to something in their field of vision and say "cat." You can also train a computer in this way: give it 1,000 pictures of cats in different poses, brightness, and emotions. You tell the computer: "All of these are cats." In fact, if you really want this method to work, you need to feed it 1,000 non-cat pictures and tell the computer which ones are cats and which ones are not. The machine's task is to develop a policy that will allow it to distinguish on its own what is a cat and what is not. It oscillates between all possible policies, trying to find the best one, the one that gives it the highest accuracy in identifying cats. It's a semi-climber, so it can use gradient descent to determine where to go. You choose a policy, put yourself in the environment, and follow the gradient descent rule. Think about what small changes you can make to your current policy, find the one that gives you the largest gradient, and do it; repeat. Greed is a pretty good thing This sounds reasonable, but then you realize you don't understand what it means. For example, what is a policy? It has to be something that a computer can execute, which means it has to be expressed in mathematical language. To a computer, a picture is just a long string of numbers. If the picture is a grid of 600×600 pixels, then each pixel has a brightness, and their value is between 0 (pure black) and 1 (pure white). Just knowing these 360,000 (600×600) numbers, you know what the picture is. (Or, at least what it looks like in black and white.) A policy is a way to turn the 360,000 numbers fed into the computer into either "cat" or "not cat" (or "1" or "0" in computer language). In mathematical terms, a policy is a function. In fact, to be closer to psychological reality, the output of a policy might be a number between 0 and 1, representing the uncertainty the machine might want to express when the input is a blurry picture of a lynx or a Garfield pillow. When the output is 0.8, we should interpret it as "I am almost sure this is a cat, but I still have doubts." For example, your strategy might be a function that says, "Output the average of the 360,000 numbers you input." If the image is completely white, the function gives you a 1; if the image is completely black, the function gives you a 0. Overall, this function measures the overall average brightness of the image on your computer screen. What does this have to do with whether the image is a cat? Nothing, and I'm not saying it's a good strategy. How do we measure the success of a policy? The simplest way is to see how the computer, having been trained on 2,000 images of cats and non-cats, performs next. For each image, we can assign an “error score” (real-world computer scientists usually call this “error or loss”) to the policy. If the picture is of a cat and the policy output is 1, then the error score is 0, which means the answer is correct. If the picture is of a cat and the policy output is 0, then the error score is 1, which is the worst possible answer. If the picture is of a cat and the policy output is 0.8, then the answer is approximately correct but with some uncertainty, with an error score of 0.2. (There are many ways to measure the degree of error, and this one is not the most common one in practice, but it is easier to describe.) By adding up the error scores for all 2,000 images you used for training, you get a total error score, which is a measure of the success of your policy. Your goal is to find a policy with the lowest total error score possible, but how can you make a policy that doesn't make mistakes? This is where gradient descent comes in, because now you know what it means for a policy to get better or worse as you tweak it. The gradient measures how much the error score changes when you make a small change to the policy; of all the small changes you could make to the policy, pick the one that gives you the biggest drop in the error score. Gradient descent doesn't just work for cats; it works for anything where you want a machine to learn a policy from experience. I don’t want to underestimate the computational challenge here. The computer learning to recognize cats is more likely to have trained itself on millions of images, not just 2,000. So when you calculate the total error score, you might have to add up a million error scores. Even if you have a powerful processor, that takes a long time! So in practice, we often use a variation of gradient descent called stochastic gradient descent. This method involves countless small changes and error scores, but the basic idea is this: In the first step, you randomly select one of the many training images (say, a picture of an angora cat or a fish tank), and then take the step that minimizes the error score for that image, rather than adding up all the error scores. In the second step, you randomly select another image and repeat the process. Over time (because this process takes many steps), eventually all the images will have been considered. What I like about stochastic gradient descent is that it sounds crazy. For example, imagine that the president of the United States is developing global strategy, and a group of subordinates are surrounding him, yelling and suggesting that the president adjust policy in a way that suits their own special interests. The president randomly selects one person every day, listens to his advice, and makes changes to policy accordingly. This is an extremely ridiculous way to manage a large country, but it works well in machine learning! So far, our descriptions are missing an important element: How do you know when to stop? You might say, well, that's easy, we can stop when no small changes we make make the error score go down. But there's a big problem: you may not have actually "reached the top"! If you were the happy climber in Figure 2, and you took a step to the left or a step to the right, you would see that neither direction was uphill. That's why you were happy: you thought you had reached the summit! Figure 2 But it isn’t. The true peak is still far away, and gradient descent can’t help you get there. You’ve fallen into what mathematicians call a “local optimum” (local maximum or local minimum, depending on whether your goal is to reach the top or the bottom). In this position, no small change can produce an improvement, but it is far from the true best position. I like to think of local optima as a mathematical model of procrastination. Suppose you have to face a boring task, such as organizing a pile of documents, most of which are related to a goal you have been trying to achieve for years. Throwing them away means you finally give in and don't want to continue. Every day, gradient descent will suggest you take a small action that will maximize your happiness for the day. Will organizing that pile of documents make you happy? No, on the contrary, it makes you feel bad. Postponing the task for one day is what gradient descent asks you to do, and the algorithm will give you the same advice every day, the next day, the next day, the next day, the next day, and the next day. In this way, you fall into the trap of local optima - the valley - and to climb higher, you must grit your teeth and cross the valley, which may be a long way, and you have to go down first and then climb up. Gradient descent is also called a "greedy algorithm" because it chooses the step that maximizes short-term benefits at every moment. Greed is one of the main fruits of the sin tree, but there is a popular saying about capitalism that "greed is good." In the field of machine learning, it is more accurate to say that "greed is a pretty good thing." Gradient descent can lead you into a local optimum, but this happens less often in practice than in theory. To get around local optima, all you need to do is temporarily stop being greedy. All good rules have exceptions. For example, after you reach the top, you can choose a random location and restart gradient descent. If you end up at the same place every time, you can be more confident that it is the best location. In Figure 2, if the climber starts gradient descent from a random location, he is more likely to reach the big mountain than to get stuck on the small mountain. In real life, it is difficult to reset yourself to a completely random position in life. It is more practical to take a random big step from your current position, rather than greedily choosing a small step. This approach is usually enough to push you to a completely new position and move towards the peak of your life. Am I right or wrong? There was one big problem. We happily decided to consider all the possible small changes and see which one would give us the best gradient. If you were a mountain climber, this would be a clear problem: you are choosing your next move in a two-dimensional space, which is equivalent to choosing one of the circles on a compass, and your goal is to find the point with the best gradient. But in fact, the infinite-dimensional space of all possible strategies for rating cat pictures is huge. There is no way to take all your options into account, which is obvious if you think in terms of a human being rather than a machine. Suppose I were writing a self-help book about gradient descent and I told you, "If you want to change your life, it's easy. Think carefully about all the ways that you could change your life, and then choose the one that works best." You would be stunned after reading this, because the space of all possible ways to change your life is too large to be exhausted. What if, through some extraordinary introspection, you could search this infinite-dimensional space? In that case, you'd have another problem, because the following strategy would definitely minimize the error score of your past life experience. Strategy: If the decision you are about to make is exactly the same as a decision you have made before, consider the decision you are considering now to be the correct one. Otherwise, flip a coin. If we use the computer learning to recognize cats instead, the above strategy will become: Strategy: For images that were identified as cats during training, output "cat". For images that were identified as non-cats, output "non-cat". For other images, flip a coin. This strategy has an error score of 0! The computer gets the right answer for all the images it was trained on. But if I show it a picture of a cat it has never seen before, it flips a coin. If I show it a picture that I have shown it and told it is a cat, but I rotate it 0.01 degrees, it also flips a coin. If I show it a picture of a refrigerator, it still flips a coin. All it can do is accurately identify the limited number of cat and non-cat images I have shown it. This is not learning, it is memorizing. We have seen two ways in which strategies can fail, and in a sense they are two extremes. 1. This strategy is wrong in many situations you will encounter. 2. This strategy only works for situations you have encountered before, but it is useless for new situations. The former problem is called "underfitting", which means that you do not make full use of your experience when formulating a strategy. The latter problem is called "overfitting", which means that you rely too much on your experience. How can we find a compromise between these two useless extreme strategies? The answer is: make this problem more like mountain climbing. Mountain climbers search a very limited space of options, and we can do the same, provided that we limit our choices. We know this instinctively. When we think about how to evaluate our life strategies, we often use the metaphor of choosing directions on the surface of the Earth, rather than randomly walking through infinite-dimensional space. The American poet Robert Frost likened it to "two diverging roads." The Talking Heads song "Once in a Lifetime" is like a sequel to Frost's poem "The Road Not Taken," and if you read it carefully, you will find that the song describes gradient descent: You might ask yourself Where does that road lead? You might ask yourself Am I right or wrong? You might say to yourself Oh my god! What have I done? You don't have to limit your choices to just one knob. Linear regression is one of the most common methods for selecting knobs. It is also the tool of choice for statisticians when they look for strategies to predict one variable based on the value of another variable. For example, a penny-pinching baseball team owner might want to know how much the team's winning percentage affects the sales of game tickets. He doesn't want to invest too much manpower and material resources in the stadium unless they can effectively translate into attendance. Figure 3: Home attendance vs. team winning percentage in the 2019 MLB season Each point on Figure 3 represents a team, the ordinate represents the winning percentage of these teams in the 2019 season, and the abscissa represents the home attendance of these teams. Your goal is to find a strategy that can predict home attendance based on the team's winning percentage. The space of options you allow yourself to consider is very small, and the strategies in it are all linear. Home attendance = Mysterious number 1 × team winning rate + Mysterious number 2 Each of these similar strategies corresponds to a straight line in the graph, and you want that line to match your data points as closely as possible. The two mysterious numbers are two knobs, and you perform gradient descent by turning them up and down until you can’t reduce the strategy’s overall error score with any fine-tuning. (The error score that works best here is the sum of the squares of the differences between the linear strategy’s predictions and the true values for all teams, so this method is often called “least squares.” Least squares is a long, well-developed method that can find the best line much faster than gradient descent, but it still works.) Finally, you will get a straight line like Figure 4. Figure 4 You may notice that even the line with the lowest error score has a high error. This is because most relationships in the real world are not strictly linear. We can try to solve this problem by including more variables as input (for example, the size of the team's stadium should be a correlated variable), but the linear strategy will still be limited in its effectiveness. For example, this strategy cannot tell you which pictures are cats. In this case, you have to venture into the wild world of nonlinearity. Deep Learning and Neural Networks One of the most important techniques being developed in the field of machine learning is called deep learning. It sometimes appears to us as a prophet, automatically providing extraordinary insights at scale. It also has a name: neural networks, as if the method can somehow capture the workings of the human brain on its own. But it wasn’t. As Meredith Broussard says, it’s just math, and not even the latest math. The basic concept was around in the late 1950s, and you can see something similar to the neural network architecture in the pile of gifts I received for my Bar Mitzvah in 1985. In addition to the check, a few chalices, and more than 20 Crosby pens, I also received my most wanted gift from my parents: a Yamaha DX21 synthesizer, which still sits in my home office. I was very proud to have a synthesizer instead of a keyboard back in 1985. Not only could you play piano, trumpet, and violin sounds with the DX21, you could make any sound you wanted with it, provided you could master the cryptic 70-page instruction manual, which contained many pictures like the one in Figure 5. Figure 5 Each "OP" box represents a synth wave, and by turning the knobs on the box you can make the sound louder, softer, fade in or out over time, and so on. This is all pretty mundane, but the real magic of the DX21 is in the connection to the operator. Figure 5 shows a Rube Goldberg-like process where the synth wave coming out of OP1 depends not only on the knobs you can turn on this box, but also on the output of OP2. The synth wave can even adjust itself, and the "feedback" arrow attached to OP4 represents this function. By turning the few knobs on each box, you can get an extremely wide range of outputs. This gave me the opportunity to experiment and make new sounds myself. The neural network is very similar to my synthesizer. It is a network composed of several small boxes, as shown in Figure 6. Figure 6 All the boxes do the same thing: if they are fed a number greater than or equal to 0.5, they output 1; otherwise, they output 0. The idea of using these boxes as the building blocks of machine learning was proposed in 1957-1958 by psychologist Frank Rosenblatt, who saw them as a simple model of how neurons work. The box sits there quietly, and once it receives a stimulus that exceeds a certain threshold, it fires a signal. Rosenblatt called these machines "perceptrons." In honor of this history, we still call these networks of pseudo-neurons "neural networks," even though most people no longer think of them as simulating human brain hardware. Once the numbers exit the box, they follow any of the arrows on the right side of the box. Each arrow has a number called a "weight" attached to it, and as the output whizzes along the arrow, it is multiplied by the corresponding weight. Each box adds up all the numbers that enter from its left side and uses that as its input. Each column is called a layer. The network in Figure 6 has two layers, with two boxes in the first layer and one box in the second layer. You first input two numbers into this neural network, corresponding to the two boxes in the first layer. Here are the possible scenarios: 1. Both inputs are not less than 0.5. Both boxes in the first layer output 1. As these two numbers move along the arrows, they both become 1/3, so the box in the second layer receives 2/3 as input and outputs 1. 2. One input is not less than 0.5, and the other input is less than 0.5. Then, the two outputs are 1 and 0 respectively, so the box in the second layer receives 1/3 as input and outputs 0. 3. Both inputs are less than 0.5. Then, both boxes in the first layer output 0, and the box in the second layer also outputs 0. In other words, this neural network is a machine that takes in two numbers as input and tells you whether they are both greater than 0.5. Figure 7 is a slightly more complex neural network. Figure 7 The first layer of this neural network has 51 boxes, all of which feed numbers into the box in the second layer. But the weights on the arrows are different, with the smallest weight being 3/538 and the largest being 55/538. What is this machine doing? It takes in 51 different numbers as input and activates each box whose input is greater than 0.5. It then performs a weighted calculation on these boxes to see if their sum is greater than 0.5. If so, it outputs 1; if not, it outputs 0. We can call it a "two-layer Rosenblatt perceptron", but it has a more common name - "Electoral College system". The 51 boxes represent the 50 states and Washington, D.C. in the United States. If the Republican candidate wins a state, the box representing that state will be activated. The total number of electoral votes in all these states is divided by 538. If the result is greater than 0.5, the Republican candidate is the winner. Figure 8 is a more modern example that is not as easy to describe in words as the Electoral College, but is closer to the neural networks that drive continued progress in machine learning. Figure 8 The box in Figure 8 is more sophisticated than the Rosenblatt perceptron box. The box receives a number as input and outputs the larger of that number and 0. In other words, if the input is a positive number, the box outputs the number unchanged; but if the input is a negative number, the box outputs 0. Let's try this device (see Figure 9). Suppose I first input 1 and 1 into the two boxes on the leftmost layer. Both numbers are positive, so both boxes on the first layer will output 1. Looking at the second layer, the first box receives 1×1 = 1, and the second box receives -1×1 = -1. Similarly, the third and fourth boxes on the second layer receive 1 and -1, respectively. 1 is a positive number, so the first box outputs 1. But the second box receives a negative input and is not triggered, so it outputs 0. Similarly, the third box outputs 1 and the fourth box outputs 0. Fig. 9 Next, let’s look at the third layer. The number received by the upper box is 1×1+3×0+2×1+1×0=3, and the number received by the lower box is 3×1−1×0−5×1−1×0=−2. Therefore, the upper box outputs 3, and the lower box fails to be triggered and outputs 0. Finally, the sum of the two inputs received by the box on the fourth layer is 1×3+1×0=3. Even if you don't pay attention to these details, it doesn't matter. What's important is that a neural network is a policy that receives two numbers as input and returns a number as output. If you change the weights on the arrows, that is, if you turn the 14 knobs, you will change the policy. Figure 9 gives you a 14-dimensional space, allowing you to find the most suitable policy based on the existing data. If you find it difficult to imagine what 14-dimensional space looks like, I suggest you follow the advice of Geoffrey Hinton, one of the founders of modern neural network theory: "Imagine a three-dimensional space and say to yourself out loud, 'This is 14-dimensional space.' Everyone should be able to do this." Hinton comes from a family of high-dimensional space enthusiasts. His great-grandfather Charles wrote a book on how to imagine four-dimensional cubes in 1904 and invented the word "tesseract" to describe them. I don't know if you have seen the oil painting "The Crucifixion" by Spanish painter Salvador Dali, which contains a Hinton hypercube. The weights of the neural network in Figure 10 are known to assign a value equal to or less than 3 to a point (x, y) on the plane if it is inside the gray shape. Note that when the point (1, 1) is on the boundary of the gray shape, the policy assigns it a value of 3. Fig.10 Different weights will produce different shapes, although not arbitrary shapes. The nature of the perceptron means that this shape is always a polygon, that is, a shape whose boundary is made up of multiple line segments. (Didn't the previous article say that this should be nonlinear? Yes, but the perceptron is a piecewise linear structure, which means that it satisfies different linear relationships in different regions of space. More general neural networks can produce more curved results.) As shown in Figure 11, suppose I have marked some points on the plane with Xs and others with Os. My goal for the machine is to learn a policy: assign Xs or Os to all other unlabeled points on the plane, based on the points I marked. Perhaps (hopefully) I can get a policy by setting those 14 knobs correctly to assign larger values to all points labeled Xs and smaller values to all points labeled Os, so that I can make educated guesses about the unlabeled points on the plane. If there is such a policy, I hope to learn it using gradient descent: turn each knob a little bit, see how much the policy reduces the error score for a given example, find the action that works best, do it, and repeat. The "deep" in deep learning just means that the neural network has many layers. The number of boxes in each layer is called the "width", and in practice, this quantity can also be large. But "width learning" is less jargon-y than "deep learning". Fig.11 To be sure, today’s deep learning networks are much more complex than the diagrams above, and the functions in the boxes are much more complex than the simple functions we’ve discussed. Recurrent neural networks also contain feedback boxes, like the “OP4” on my DX21 synthesizer, which takes its own output as input. And they’re significantly faster. As we can see, the concept of neural networks has been around for a long time, and I remember that not long ago, people thought that this path would never work. But it turned out to be a good idea, but the hardware had to catch up with the concept. GPU chips, designed for rendering game graphics quickly, later proved to be ideal for quickly training large neural networks, helping experimenters increase the depth and width of neural networks. With modern processors, you are no longer limited to 14 knobs, but can manipulate thousands, millions, or even more knobs. GPT-3 generates English text that looks real, and it uses a neural network with 175 billion knobs. A space of 175 billion dimensions sounds big, but it’s tiny compared to infinity. Likewise, we’re exploring only a tiny fraction of the space of all possible strategies. But in practice, this seems to be enough to generate text that looks human, just as DX21’s small network is enough to simulate the timbres of trumpets, cellos, and space thunderbolts. That’s pretty amazing, but there’s an even deeper mystery. Remember, the idea of gradient descent is to keep turning the knobs until the neural network performs as well as possible on the data points it was trained on. Today’s neural networks have lots and lots of knobs, so they can often perform perfectly on the training set, classifying every one of 1,000 cat pictures as “cat” and every other one of 1,000 pictures as “not cat.” In fact, there are so many knobs to turn that the space of all possible strategies that get the training data right 100% of the time is huge. It turns out that most of these strategies perform badly when the neural network is faced with images it has never seen before. But the stupid and greedy process of gradient descent often appears more often in some strategies than others, and in practice, the strategies favored by gradient descent seem to generalize more easily to new examples. Why? What makes this particular form of neural network so good at a wide variety of learning problems? Why does this tiny region of policy space that we search contain a good policy? As far as I know, it is a mystery. Frankly, there is a lot of debate about whether it is a mystery or not. I have asked this question to many well-known AI researchers, and they all answered it eloquently. Some of them explained the reasons very confidently, but everyone had a different answer. About the Author Jordan Stuart Ellenberg (1971-), an American mathematician, received his Ph.D. from Harvard University in 1998 and is currently the John D. MacArthur Professor at the University of Wisconsin-Madison. His main research areas are algebraic geometry and number theory. He has won many science communication awards and published books such as How Not to Be Wrong, Shape, and the novel The Grasshopper King. His works are often seen in The Wall Street Journal, The New York Times, Slate, Wired, etc. This article is authorized to be excerpted from Chapter 7 "Machine Learning is Like Climbing a Mountain" of "The Power of Geometry" (CITIC Press·Nautilus, 2023.3), with some deletions. |
>>: Yaya is home! What "compulsory courses" will it take after returning home?
[[138822]] In fact, we all know clearly that &quo...
" Laso " receives cosmic rays from W51 ...
The birth of iPhone in 2017 ushered in a new roun...
Based on different comparison references, the art...
Overview This is the second article of iOS compon...
Spring responsive programming practical resources...
When traffic and users in various industries are ...
According to technology media 9to5mac, a video of ...
Welcome National Day The "artificial sun&quo...
Today we introduce a new concept: dynamic loading...
Did you know that changing sex is quite common in...
[51CTO.com original article] The WOT 2016 Big Dat...
One minute with the doctor, the postures are cons...
Tik Tok, Xigua Video and Huoshan Video are all sh...
Author: Gui Youyu Reviewer: Zhou Yingcao, Chief E...