Introducing the attention mechanism into RNN to solve sequence prediction problems in five major areas

Introducing the attention mechanism into RNN to solve sequence prediction problems in five major areas

[[198915]]

The encoder-decoder structure is popular because it has shown the current state of the art in many fields. The limitation of this structure is that it encodes the input sequence into an internal representation of a fixed length. This limits the length of the input sequence and causes the model to perform poorly for particularly long input sequences.

In this blog post, we will discover that we can overcome this limitation by using an attention mechanism in recurrent neural networks.

After reading this blog, you will know:

  • Limitations of encoder-decoder architecture and fixed-length internal representations
  • Let the network learn to pay attention to the corresponding position in the input sequence for each item in the output sequence
  • Application of recurrent neural networks with attention mechanism in 5 fields including text translation, speech recognition, etc.

The problem with long sequences

In an encoder-decoder recurrent neural network, a series of long short-term memory networks (LSTMs) learn to encode the input sequence into a fixed-length internal representation, and another part of the LSTM network reads the internal representation and decodes it into the output sequence. This structure has demonstrated the current state-of-the-art in difficult sequence prediction problems (such as text translation) and has quickly become the dominant method. For example, the following two papers:

  • Sequence to Sequence Learning with Neural Networks (2014)
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014)

The encoder-decoder architecture is still able to achieve excellent results on many problems. However, it suffers from a limitation that all input sequences are forced to be encoded into internal vectors of fixed length. This limitation limits the performance of these networks, especially when considering longer input sequences, such as long sentences in text translation.

“One potential problem with this encoder-decoder approach is that the neural network needs to compress all the necessary information in the source sentence into a fixed-length vector. This makes it difficult for the neural network to process long sentences, especially those longer than the training corpus.”

—— Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015

Attention mechanism in sequence

The attention mechanism is a method that frees the encoder-decoder architecture from fixed-length internal representations. It works by keeping the intermediate output of the LSTM encoder for each step of the input sequence, and then training the model to learn how to selectively pay attention to the inputs and associate them with items in the output sequence. In other words, each item in the output sequence depends on the selected items in the input sequence.

“The model proposed in the paper searches for the most relevant information in a sequence of positions in the source sentence for each word generated during translation. It then predicts the next target word based on the context vector and the positions in the source text and the previously generated target words.” “…The model encodes the input sentence into a sequence of vectors and adaptively selects a subset of these vectors when decoding the translation. This saves the neural translation model from having to compress all the information in source sentences of varying lengths into a fixed-length vector.”

——Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate (https://arxiv.org/abs/1409.0473), 2015

Although this will increase the computational burden of the model, it will form a more targeted and better performing model. In addition, the model can also show how to focus on the input sequence when predicting the output sequence. This will help us understand and analyze what the model is focusing on and to what extent it focuses on specific input-output pairs.

“The proposed method allows us to visually observe the (soft) alignment of each word in the generated sequence with some words in the input sequence, which can be achieved by visualizing the annotation weights… Each row of the matrix in each figure represents the weight associated with an annotation. This allows us to see which position in the source sentence is emphasized when generating the target word.”

——Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate (https://arxiv.org/abs/1409.0473), 2015

Problems with using large images

Convolutional neural networks used in computer vision face a similar problem: it is difficult to train models with very large images. As a result, images are observed a large number of times to get an approximate impression before making a prediction.

“An important feature of human perception is that we tend not to process the entire scene at once, but selectively focus our attention on certain parts of the visual space to obtain the required information, and combine local information at different time points to construct an internal representation of the entire scene, thereby guiding subsequent eye movements and decisions.”

——Recurrent Models of Visual Attention (https://arxiv.org/abs/1406.6247), 2014

These glimpse-based corrections can also be considered attention mechanisms, but they are not the attention mechanisms discussed in this article.

Related papers:

  • Recurrent Models of Visual Attention, 2014
  • DRAW: A Recurrent Neural Network For Image Generation, 2014
  • Multiple Object Recognition with Visual Attention, 2014

5 Examples of Using Attention Mechanism for Sequence Prediction

This section gives some concrete examples of combining attention mechanisms with recurrent neural networks for sequence prediction.

1. Attention Mechanism in Text Translation

We have already mentioned the example of text translation. Given an input sequence of French sentences, translate it and output an English sentence. The attention mechanism is used to observe the specific words in the input sequence that correspond to each word in the output sequence.

“We extend the basic encoder-decoder architecture by having the model search for some input words or word annotations computed by the encoder when generating each target word. This frees the model from having to encode the entire source sentence into a fixed-length vector and allows the model to focus only on information relevant to the next target word.”

——Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate (https://arxiv.org/abs/1409.0473, 2015

Figure caption: Columns are input sequences, rows are output sequences, and the highlighted blocks represent the association between the two. The lighter the color, the stronger the association.

Image from the paper: Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015

2. Attention Mechanism in Image Description

Unlike the glimpse method, sequence-based attention mechanisms can be applied to computer vision problems to help find ways to better use convolutional neural networks to focus on input images when outputting sequences, such as in typical image description tasks. Given an input image, output an English description of the image. The attention mechanism is used to focus on the local image associated with each word in the output sequence.

“We propose an attention-based approach that achieves state-of-the-art performance on three benchmark datasets… We also show how to use the learned attention mechanism to provide more interpretability to the model generation process, demonstrating that the learned alignment is highly consistent with human intuition.”

—— Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016

Figure caption: Similar to the above figure, the underlined words in the output text correspond to the floodlit areas in the right picture

Image from the paper: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016

3. Attention Mechanism in Semantic Entailment

Given a premise scenario and a hypothesis about the scenario in English, the output is whether the premise and hypothesis contradict each other, whether the two are related to each other, or whether the premise implies the hypothesis.

For example:

  • Prerequisite: "Photos from the wedding"
  • Assumption: "Someone is getting married"

The attention mechanism is used to associate each word in the hypothesis with the words in the premise and vice versa.

We propose a neural LSTM-based model that reads two sentences as one for semantic entailment analysis, rather than encoding each sentence independently into a semantic vector. We then extend the model with a neural word-by-word attention mechanism to encourage reasoning about entailment relationships between pairs of words and phrases… The extended model achieves a 2.6% improvement on the benchmark score of the LSTM, setting a new accuracy record…

——Reasoning about Entailment with Neural Attention (https://arxiv.org/abs/1509.06664), 2016

Image from the paper: Reasoning about Entailment with Neural Attention, 2016

4. Attention Mechanism in Speech Recognition

Given an English speech segment as input, it outputs a sequence of phonemes. The attention mechanism is used to associate each phoneme in the output sequence with a specific speech frame in the input sequence.

“…propose a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism that combines both content and position information to select the next position in the input sequence during decoding. The model is promising in that it can recognize speech longer than the corpus it was trained on.”

——Attention-Based Models for Speech Recognition (https://arxiv.org/abs/1506.07503), 2015.

Image from the paper: Attention-Based Models for Speech Recognition, 2015.

5. Attention Mechanism in Text Summarization

Given an English article as an input sequence, output an English text to summarize the input sequence. The attention mechanism is used to associate each word in the summary text with the corresponding word in the source text.

“…propose a model for abstractive summarization based on a neutral attention mechanism, building on recent advances in neural machine translation. We combine this probabilistic model with a generative algorithm that produces accurate abstractive summaries.”

——A Neural Attention Model for Abstractive Sentence Summarization (https://arxiv.org/abs/1509.00685), 2015

Image from the paper: A Neural Attention Model for Abstractive Sentence Summarization, 2015.

Further reading

If you are interested in adding attention mechanism to LSTM, you can read the following:

  • Attention and memory in deep learning and NLP
  • Attention Mechanism
  • Survey on Attention-based Models Applied in NLP (http://yanran.li/peppypapers/2015/10/07/survey-attention-model-1.html)
  • [Quora Q&A] What is exactly the attention mechanism introduced to RNN? https://www.quora.com/What-is-exactly-the-attention-mechanism-introduced-to-RNN-recurrent-neural-network-It-would-be-nice-if-you-could-make-it-easy-to-understand
  • What is the attention mechanism in neural networks? (What is Attention Mechanism in Neural Networks? https://www.quora.com/What-is-Attention-Mechanism-in-Neural-Networks)

Summarize

This blog post introduces the use of attention mechanism in LSTM recurrent neural networks for sequence prediction.

Specifically:

  • The encoder-decoder structure in recurrent neural networks uses a fixed-length internal representation, which imposes limitations on the learning of very long sequences.
  • The attention mechanism overcomes this limitation of the encoder-decoder architecture by allowing the network to learn to correspond each item in the output sequence to the corresponding item in the input sequence.
  • This method has been applied to a variety of sequence prediction problems, including text translation, speech recognition, etc.

<<:  Activity launch mode (launchMode) detailed explanation

>>:  The second round of recruitment of Aiti tribe administrators has begun

Recommend

The largest in Asia! It will be put into operation next Monday!

The picture shows Beijing Fengtai Station. (Photo...

How to do social private domain marketing well?

In 2021, more and more large companies will deplo...

23 of Apple's highest-paying jobs: Coders are in high demand

As the world's largest company by market valu...

Tips for promoting your notes on Xiaohongshu!

In the era of social e-commerce and content e-com...

Luban’s 11.11 e-commerce marketing strategy in 2019!

E-commerce advertising has three main purposes: f...

Ten thousand words of practical information | How to build a user life cycle?

I didn’t originally plan to write this article ab...

Do you believe that Android users can use Apple Maps?

Since its launch, Apple Maps has been criticized ...

This whale named "Little Guagua" died, probably because humans were too noisy

Do you remember the animal events that frequently...

General process of online event operation

There is no clear definition of operation and spe...