Everything in life is time-related, which forms a sequence. For sequence data (text, speech, video, etc.), we can use a neural network and import the entire sequence, but in this way our data input size is fixed, and the limitation is obvious. If the important time series feature events happen to fall outside the input window, it will cause greater problems. So what we need is:
Figure 1: Long Short-Term Memory (LSTM) cell. LSTM has four input weights and four recurrent weights. Peepholes are additional connections between memory cells and gates, but they do not help improve performance and are often ignored. Recurrent Neural NetworksIf we want a regular neural network to solve the problem of adding two numbers, we only need to input two numbers and then train the prediction of the sum of the two numbers. If there are now 3 numbers to be added, then we can:
Solution (2) is obviously better, because we want to avoid retraining the entire network (the network already "knows" how to add two numbers). If our task becomes: first add two numbers, then subtract two different numbers, then this solution will not work. Even if we use additional weights, we cannot guarantee the correct output. Instead, we can try to "modify the program" and change the network from "addition" to "subtraction". This can be achieved by weighting the hidden layers (see Figure 2), so that the core of the network changes with each new input. The network will learn to change the program from "addition" to "subtraction" after adding two numbers, and then solve the problem. We can even generalize this method and pass two numbers to the network, and then pass in a "special" number - representing the mathematical operation "addition", "subtraction" or "multiplication". In practice, this may not be perfect, but we can get roughly correct results. However, the main problem here is not getting the correct result, but we can train the recurrent neural network so that it can learn the special output produced by any input sequence, which is very powerful. For example, we can teach a network to learn sequences of words. Soumith Chintala and Wojciech Zaremba wrote an excellent blog post about using RNNs for natural language processing. RNNs can also be used to generate sequences. Andrej Karpathy wrote this interesting and lively blog post showing how word-level RNNs can be used to imitate a variety of literary styles, from Shakespeare to Linux source code to naming children. Long Short Term Memory (LSTM)Long short-term memory cells use self-connected linear units with a constant weight of 1.0. This allows the values (forward propagation) or gradients (backward propagation) flowing into the self-loop to remain unchanged (the input or error multiplied by 1.0 remains the same; the output or error of the previous time step is also the same as the output of the next time step), so all values and gradients can be accurately called back when needed. This self-looping unit, the memory cell, provides a memory function that can store information, which is valid for several previous time steps. This is extremely effective for many tasks, such as text data, where LSTM can store information from the previous paragraph and apply this information to the sequence of the current paragraph. In addition, a common problem in deep networks is called the "vanishing gradient" problem, that is, the gradient becomes smaller and smaller as the number of layers increases. With the memory cells in LSTM, there is a continuous gradient flow (the error remains the original value), which eliminates the gradient vanishing problem and can learn sequences as long as hundreds of time steps. However, sometimes we want to discard old information and replace it with newer, more relevant information. At the same time, we don’t want to release invalid information to interfere with the rest of the network. To solve this problem, the LSTM unit has a forget gate that deletes the information in the self-loop unit without releasing information to the network (see Figure 1). The forget gate multiplies the value in the memory cell by a number between 0 and 1, where 0 means forget and 1 means keep it as it is. The specific value is determined by the current input and the output of the LSTM unit in the previous time step. At other times, the memory cell also needs to remain unchanged for multiple time steps, so LSTM adds another gate, the input gate (or write gate). When the input gate is closed, new information will not flow in, and the original information will be protected. Another gate multiplies the output value of the memory cell by a number between 0 (erasing the output) and 1 (). This is useful when multiple memories are competing with each other: one memory cell may say, "My memory is very important! So I'm releasing it now", but the network may say, "Your memory is important, but there are other more important memory cells right now, so I give your output gate a small value and give the other gates a large value so that they will win." The way LSTM cells are connected may seem a bit complicated at first, and it takes some time to understand. But when you examine each component separately, you will find that its structure is actually no different from that of a normal recurrent neural network - the input and recurrent weights flow to all the gates and connect to the self-recurrent memory cells. To get a deeper understanding of LSTM and get to know the whole architecture, I recommend reading: LSTM: A Search Space Odyssey and the original LSTM paper. Word EmbeddingFigure 3: The two-dimensional word embedding space of recipes. Here we zoom in on the “Southern Europe” cluster. Think of "cat" and all the other words related to "cat", you might think of "kitten", "feline". Think of some less similar words, but much more similar than "car", such as "lion", "tiger", "dog", "animal" or verbs "purring", "mewing", "sleeping" and so on. Imagine a three-dimensional space, and put the word "cat" in the middle. Among the words mentioned above, those similar to "cat" are closer in space; for example, "kitty" and "feline" are very close to the center; "tiger" and "lion" are a little further away; "dog" is even further away; and "car" is nowhere to be found. You can see an example of this word being embedded in a two-dimensional space in Figure 3. If we use vectors to represent each word in the space, then each vector consists of 3 coordinates, for example, "cat" is (0, 0, 0), "kitty" may be (0,1, 0,2, -0,3) and "car" is (10, 0, -15). This vector space is the word embedding space, and the three coordinates corresponding to each word can be used as input data for the algorithm. A typical word embedding space contains thousands of words and hundreds of dimensions, which is difficult for humans to understand intuitively, but the rule that similar words are close still holds true. For machines, this is a good vocabulary representation that can improve natural language processing capabilities. If you want to learn more about word embeddings and how they can be used to create models that “understand” language, I recommend reading: Understanding Natural Language with Deep Neural Networks Using Torch, by Soumith Chintala and Wojciech Zaremba. Encoding-DecodingLet's stop natural language processing for a moment and imagine a tomato and the ingredients or dishes that go well with it. If your ideas are similar to the most common recipes on the Internet, you might think of ingredients like cheese and salami; Parmesan cheese, basil, macaroni; or other ingredients like olive oil, thyme, and celery. (If you were a Chinese, it would definitely be eggs.) These ingredients are mainly Italian and Mediterranean cuisine. It's that same tomato. If you think of Mexican food, you might think of beans, corn, peppers, cilantro, or avocado. What you just thought about was changing the representation of the word "tomato" into a new representation: "tomato in Mexican cuisine." The "Encoder" does the same thing. It transforms the input words into new "thought vectors" one by one by transforming the representation of the words. Just like adding the context "Mexican food" to "tomato", this is the first step of the "encoder-decoder" architecture. The second step of the encoding-decoding architecture is based on the fact that different languages have similar geometric structures in the word embedding space, even if the words used to describe the same thing are completely different. For example, in German, "cat" is "Katze" and "dog" is "Hund", which is completely different from English, but the relationship between the two words is indeed the same. The relationship between Karze and Hund is exactly the same as the relationship between Car and Dog. In other words, even if the words themselves are different, the "thinking vectors" behind them are indeed the same. Of course, there are some words that are difficult to express in other languages (such as "缘分" in Chinese), but this situation is relatively rare and generally holds true. Based on the above ideas, we can build a decoding network. We pass the "thought vectors" generated by the English encoder to the German decoder. The German decoder will map these thought vectors or relational transformations into the German word embedding space, and then generate a sentence that maintains the relationships in the English sentence. In this way, we have a network that can do translation. This idea is still under development. Although the results are not perfect, they are improving rapidly and will soon become the best method for translation. |
<<: Android resolution adaptation test
The AARRR funnel model is also known as the Pirat...
iOS14.7 and iPadOS14.7 were originally planned to...
2020 is a very unforgettable year in the hearts o...
In order to seek breakthroughs under the epidemic...
About 65 million years ago, a large meteorite hit...
Online shopping is becoming more and more popular...
After the VR concept and the 3 billion acquisitio...
After more than a year of rumors, Microsoft final...
The Asian Winter Games not only provides a fierce...
There is a large cat species on the African grass...
• Introduction• As the saying goes, brands levera...
Did you know: What kind of brands are suitable fo...
Once the back-end strategy and front-end gameplay...
Part 1 Isn’t Beidou enough? What is PNT? On July ...
Snow-white cotton is one of the sources of warmth...