The core of natural language processing: sequence learning

Everything in life is time-related, which forms a sequence. For sequence data (text, speech, video, etc.), we can use a neural network and import the entire sequence, but in this way our data input size is fixed, and the limitation is obvious. If the important time series feature events happen to fall outside the input window, it will cause greater problems. So what we need is:

A neural network that can read sequences of arbitrary length element by element (e.g. a video is just a sequence of images; we feed the neural network one image at a time);
Neural networks with memory, able to remember events from several time steps ago. These problems and needs have given rise to many different recurrent neural networks.

Figure 1: Long Short-Term Memory (LSTM) cell. LSTM has four input weights and four recurrent weights. Peepholes are additional connections between memory cells and gates, but they do not help improve performance and are often ignored.

Recurrent Neural Networks

If we want a regular neural network to solve the problem of adding two numbers, we only need to input two numbers and then train the prediction of the sum of the two numbers. If there are now 3 numbers to be added, then we can:

Expand the network architecture, add inputs and weights, and retrain;
Take the first output (the sum of the two numbers) and the third number as input and return them to the network.

Solution (2) is obviously better, because we want to avoid retraining the entire network (the network already "knows" how to add two numbers). If our task becomes: first add two numbers, then subtract two different numbers, then this solution will not work. Even if we use additional weights, we cannot guarantee the correct output. Instead, we can try to "modify the program" and change the network from "addition" to "subtraction". This can be achieved by weighting the hidden layers (see Figure 2), so that the core of the network changes with each new input. The network will learn to change the program from "addition" to "subtraction" after adding two numbers, and then solve the problem.

We can even generalize this method and pass two numbers to the network, and then pass in a "special" number - representing the mathematical operation "addition", "subtraction" or "multiplication". In practice, this may not be perfect, but we can get roughly correct results. However, the main problem here is not getting the correct result, but we can train the recurrent neural network so that it can learn the special output produced by any input sequence, which is very powerful.

For example, we can teach a network to learn sequences of words. Soumith Chintala and Wojciech Zaremba wrote an excellent blog post about using RNNs for natural language processing. RNNs can also be used to generate sequences. Andrej Karpathy wrote this interesting and lively blog post showing how word-level RNNs can be used to imitate a variety of literary styles, from Shakespeare to Linux source code to naming children.

Long Short Term Memory (LSTM)

Long short-term memory cells use self-connected linear units with a constant weight of 1.0. This allows the values (forward propagation) or gradients (backward propagation) flowing into the self-loop to remain unchanged (the input or error multiplied by 1.0 remains the same; the output or error of the previous time step is also the same as the output of the next time step), so all values and gradients can be accurately called back when needed. This self-looping unit, the memory cell, provides a memory function that can store information, which is valid for several previous time steps. This is extremely effective for many tasks, such as text data, where LSTM can store information from the previous paragraph and apply this information to the sequence of the current paragraph.

In addition, a common problem in deep networks is called the "vanishing gradient" problem, that is, the gradient becomes smaller and smaller as the number of layers increases. With the memory cells in LSTM, there is a continuous gradient flow (the error remains the original value), which eliminates the gradient vanishing problem and can learn sequences as long as hundreds of time steps.

However, sometimes we want to discard old information and replace it with newer, more relevant information. At the same time, we don’t want to release invalid information to interfere with the rest of the network. To solve this problem, the LSTM unit has a forget gate that deletes the information in the self-loop unit without releasing information to the network (see Figure 1). The forget gate multiplies the value in the memory cell by a number between 0 and 1, where 0 means forget and 1 means keep it as it is. The specific value is determined by the current input and the output of the LSTM unit in the previous time step.

At other times, the memory cell also needs to remain unchanged for multiple time steps, so LSTM adds another gate, the input gate (or write gate). When the input gate is closed, new information will not flow in, and the original information will be protected.

Another gate multiplies the output value of the memory cell by a number between 0 (erasing the output) and 1 (). This is useful when multiple memories are competing with each other: one memory cell may say, "My memory is very important! So I'm releasing it now", but the network may say, "Your memory is important, but there are other more important memory cells right now, so I give your output gate a small value and give the other gates a large value so that they will win."

The way LSTM cells are connected may seem a bit complicated at first, and it takes some time to understand. But when you examine each component separately, you will find that its structure is actually no different from that of a normal recurrent neural network - the input and recurrent weights flow to all the gates and connect to the self-recurrent memory cells.

To get a deeper understanding of LSTM and get to know the whole architecture, I recommend reading: LSTM: A Search Space Odyssey and the original LSTM paper.

Word Embedding

Figure 3: The two-dimensional word embedding space of recipes. Here we zoom in on the “Southern Europe” cluster.

Think of "cat" and all the other words related to "cat", you might think of "kitten", "feline". Think of some less similar words, but much more similar than "car", such as "lion", "tiger", "dog", "animal" or verbs "purring", "mewing", "sleeping" and so on.

Imagine a three-dimensional space, and put the word "cat" in the middle. Among the words mentioned above, those similar to "cat" are closer in space; for example, "kitty" and "feline" are very close to the center; "tiger" and "lion" are a little further away; "dog" is even further away; and "car" is nowhere to be found. You can see an example of this word being embedded in a two-dimensional space in Figure 3.

If we use vectors to represent each word in the space, then each vector consists of 3 coordinates, for example, "cat" is (0, 0, 0), "kitty" may be (0,1, 0,2, -0,3) and "car" is (10, 0, -15). This vector space is the word embedding space, and the three coordinates corresponding to each word can be used as input data for the algorithm.

A typical word embedding space contains thousands of words and hundreds of dimensions, which is difficult for humans to understand intuitively, but the rule that similar words are close still holds true. For machines, this is a good vocabulary representation that can improve natural language processing capabilities.

If you want to learn more about word embeddings and how they can be used to create models that “understand” language, I recommend reading: Understanding Natural Language with Deep Neural Networks Using Torch, by Soumith Chintala and Wojciech Zaremba.

Encoding-Decoding

Let's stop natural language processing for a moment and imagine a tomato and the ingredients or dishes that go well with it. If your ideas are similar to the most common recipes on the Internet, you might think of ingredients like cheese and salami; Parmesan cheese, basil, macaroni; or other ingredients like olive oil, thyme, and celery. (If you were a Chinese, it would definitely be eggs.) These ingredients are mainly Italian and Mediterranean cuisine.

It's that same tomato. If you think of Mexican food, you might think of beans, corn, peppers, cilantro, or avocado.

What you just thought about was changing the representation of the word "tomato" into a new representation: "tomato in Mexican cuisine."

The "Encoder" does the same thing. It transforms the input words into new "thought vectors" one by one by transforming the representation of the words. Just like adding the context "Mexican food" to "tomato", this is the first step of the "encoder-decoder" architecture.

The second step of the encoding-decoding architecture is based on the fact that different languages have similar geometric structures in the word embedding space, even if the words used to describe the same thing are completely different. For example, in German, "cat" is "Katze" and "dog" is "Hund", which is completely different from English, but the relationship between the two words is indeed the same. The relationship between Karze and Hund is exactly the same as the relationship between Car and Dog. In other words, even if the words themselves are different, the "thinking vectors" behind them are indeed the same. Of course, there are some words that are difficult to express in other languages (such as "缘分" in Chinese), but this situation is relatively rare and generally holds true.

Based on the above ideas, we can build a decoding network. We pass the "thought vectors" generated by the English encoder to the German decoder. The German decoder will map these thought vectors or relational transformations into the German word embedding space, and then generate a sentence that maintains the relationships in the English sentence. In this way, we have a network that can do translation. This idea is still under development. Although the results are not perfect, they are improving rapidly and will soon become the best method for translation.

<<: Android resolution adaptation test

>>: Where is personal privacy security going? A large number of apps secretly collect and track personal information

Pew: Survey shows that more than 80% of Americans do not use AI at work

Blog

15-day short video gold mining camp: reveal the secrets of making money with short videos and start a new sideline business with a monthly income of over 10,000 yuan

Blog

Crowdfunding is not lending: just talking about money is not enough

iPad battery can be cycled 1,000 times, while iPhone battery can only be cycled 500 times. Why are they treated differently?

Careful Apple fans will surely find that Apple tr...

These on the sea surface will destroy the marine food chain! (Not white garbage)

Today is World Oceans Day 2022. Let us protect th...

Geely suddenly hits the brakes: December sales plummet 39% and market value evaporates HK$32 billion in 5 days

Has the turning point arrived? What happened to G...

Video account sales tutorial: Teach you how to make health tea step by step, and easily earn 100,000+ yuan a month [Video course]

The formula for making money on the Internet is: ...

The core of natural language processing: sequence learning

Recurrent Neural Networks

Long Short Term Memory (LSTM)

Word Embedding

Encoding-Decoding

Pew: Survey shows that more than 80% of Americans do not use AI at work

15-day short video gold mining camp: reveal the secrets of making money with short videos and start a new sideline business with a monthly income of over 10,000 yuan

Crowdfunding is not lending: just talking about money is not enough

4 indicators to effectively identify channel cheating using data analysis!

What are cosmic ray particles that exceed theoretical limits?

Tips for starting a store on Pinduoduo without a source of goods

How much does it cost to apply for a 400 number? How much does it cost to make a 400 phone call?

New Energy Vehicle Channel Development in 2022-2023

Read the full text of "The Sword Is Supreme" for free, and the latest chapter list of "The Sword Is Supreme"!

The "Song of the 24 Solar Terms" actually has eight lines, and you may not have sung all of them.

Recommend

To find out the hot marketing trends in the second half of 2018, just look at these 7 cases!

Is the mask subsidy real? How to get mask subsidies on WeChat

Analogous to WeChat, how to compress Apk to the extreme, talk about the 8 steps of Android compression

iPad battery can be cycled 1,000 times, while iPhone battery can only be cycled 500 times. Why are they treated differently?

These on the sea surface will destroy the marine food chain! (Not white garbage)

Geely suddenly hits the brakes: December sales plummet 39% and market value evaporates HK$32 billion in 5 days

Video account sales tutorial: Teach you how to make health tea step by step, and easily earn 100,000+ yuan a month [Video course]

2016 Ford Global Trends Report

Chang'e 6 is launched today! Reviewing Chang'e's 17-year lunar exploration feat!

Both friend and foe, what impact do microorganisms have on human health?

High risk of Down syndrome screening = "problem" in newborns? Experts: Too early to tell!

If a giant planet hits the Earth in 100 years, where will humanity go?

When you put the thermos bottle next to your ear, it will make a buzzing sound. What mysterious substance is inside it?

Kuaishou, you can record everyone’s life, but please don’t distort our values!

What are the functions of the Lanzhou WeChat blind date mini program? How much does it cost to make a matchmaking app?