The encoder-decoder structure is popular because it has shown the current state of the art in many fields. The limitation of this structure is that it encodes the input sequence into an internal representation of a fixed length. This limits the length of the input sequence and causes the model to perform poorly for particularly long input sequences. In this blog post, we will discover that we can overcome this limitation by using an attention mechanism in recurrent neural networks. After reading this blog, you will know:
The problem with long sequencesIn an encoder-decoder recurrent neural network, a series of long short-term memory networks (LSTMs) learn to encode the input sequence into a fixed-length internal representation, and another part of the LSTM network reads the internal representation and decodes it into the output sequence. This structure has demonstrated the current state-of-the-art in difficult sequence prediction problems (such as text translation) and has quickly become the dominant method. For example, the following two papers:
The encoder-decoder architecture is still able to achieve excellent results on many problems. However, it suffers from a limitation that all input sequences are forced to be encoded into internal vectors of fixed length. This limitation limits the performance of these networks, especially when considering longer input sequences, such as long sentences in text translation. “One potential problem with this encoder-decoder approach is that the neural network needs to compress all the necessary information in the source sentence into a fixed-length vector. This makes it difficult for the neural network to process long sentences, especially those longer than the training corpus.”
Attention mechanism in sequenceThe attention mechanism is a method that frees the encoder-decoder architecture from fixed-length internal representations. It works by keeping the intermediate output of the LSTM encoder for each step of the input sequence, and then training the model to learn how to selectively pay attention to the inputs and associate them with items in the output sequence. In other words, each item in the output sequence depends on the selected items in the input sequence. “The model proposed in the paper searches for the most relevant information in a sequence of positions in the source sentence for each word generated during translation. It then predicts the next target word based on the context vector and the positions in the source text and the previously generated target words.” “…The model encodes the input sentence into a sequence of vectors and adaptively selects a subset of these vectors when decoding the translation. This saves the neural translation model from having to compress all the information in source sentences of varying lengths into a fixed-length vector.”
Although this will increase the computational burden of the model, it will form a more targeted and better performing model. In addition, the model can also show how to focus on the input sequence when predicting the output sequence. This will help us understand and analyze what the model is focusing on and to what extent it focuses on specific input-output pairs. “The proposed method allows us to visually observe the (soft) alignment of each word in the generated sequence with some words in the input sequence, which can be achieved by visualizing the annotation weights… Each row of the matrix in each figure represents the weight associated with an annotation. This allows us to see which position in the source sentence is emphasized when generating the target word.”
Problems with using large imagesConvolutional neural networks used in computer vision face a similar problem: it is difficult to train models with very large images. As a result, images are observed a large number of times to get an approximate impression before making a prediction. “An important feature of human perception is that we tend not to process the entire scene at once, but selectively focus our attention on certain parts of the visual space to obtain the required information, and combine local information at different time points to construct an internal representation of the entire scene, thereby guiding subsequent eye movements and decisions.”
These glimpse-based corrections can also be considered attention mechanisms, but they are not the attention mechanisms discussed in this article. Related papers:
5 Examples of Using Attention Mechanism for Sequence PredictionThis section gives some concrete examples of combining attention mechanisms with recurrent neural networks for sequence prediction. 1. Attention Mechanism in Text TranslationWe have already mentioned the example of text translation. Given an input sequence of French sentences, translate it and output an English sentence. The attention mechanism is used to observe the specific words in the input sequence that correspond to each word in the output sequence. “We extend the basic encoder-decoder architecture by having the model search for some input words or word annotations computed by the encoder when generating each target word. This frees the model from having to encode the entire source sentence into a fixed-length vector and allows the model to focus only on information relevant to the next target word.”
Figure caption: Columns are input sequences, rows are output sequences, and the highlighted blocks represent the association between the two. The lighter the color, the stronger the association. Image from the paper: Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015 2. Attention Mechanism in Image DescriptionUnlike the glimpse method, sequence-based attention mechanisms can be applied to computer vision problems to help find ways to better use convolutional neural networks to focus on input images when outputting sequences, such as in typical image description tasks. Given an input image, output an English description of the image. The attention mechanism is used to focus on the local image associated with each word in the output sequence. “We propose an attention-based approach that achieves state-of-the-art performance on three benchmark datasets… We also show how to use the learned attention mechanism to provide more interpretability to the model generation process, demonstrating that the learned alignment is highly consistent with human intuition.”
Figure caption: Similar to the above figure, the underlined words in the output text correspond to the floodlit areas in the right picture Image from the paper: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016 3. Attention Mechanism in Semantic EntailmentGiven a premise scenario and a hypothesis about the scenario in English, the output is whether the premise and hypothesis contradict each other, whether the two are related to each other, or whether the premise implies the hypothesis. For example:
The attention mechanism is used to associate each word in the hypothesis with the words in the premise and vice versa. We propose a neural LSTM-based model that reads two sentences as one for semantic entailment analysis, rather than encoding each sentence independently into a semantic vector. We then extend the model with a neural word-by-word attention mechanism to encourage reasoning about entailment relationships between pairs of words and phrases… The extended model achieves a 2.6% improvement on the benchmark score of the LSTM, setting a new accuracy record…
Image from the paper: Reasoning about Entailment with Neural Attention, 2016 4. Attention Mechanism in Speech RecognitionGiven an English speech segment as input, it outputs a sequence of phonemes. The attention mechanism is used to associate each phoneme in the output sequence with a specific speech frame in the input sequence. “…propose a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism that combines both content and position information to select the next position in the input sequence during decoding. The model is promising in that it can recognize speech longer than the corpus it was trained on.”
Image from the paper: Attention-Based Models for Speech Recognition, 2015. 5. Attention Mechanism in Text SummarizationGiven an English article as an input sequence, output an English text to summarize the input sequence. The attention mechanism is used to associate each word in the summary text with the corresponding word in the source text. “…propose a model for abstractive summarization based on a neutral attention mechanism, building on recent advances in neural machine translation. We combine this probabilistic model with a generative algorithm that produces accurate abstractive summaries.”
Image from the paper: A Neural Attention Model for Abstractive Sentence Summarization, 2015. Further readingIf you are interested in adding attention mechanism to LSTM, you can read the following:
SummarizeThis blog post introduces the use of attention mechanism in LSTM recurrent neural networks for sequence prediction. Specifically:
|
<<: Activity launch mode (launchMode) detailed explanation
>>: The second round of recruitment of Aiti tribe administrators has begun
The picture shows Beijing Fengtai Station. (Photo...
In 2021, more and more large companies will deplo...
How much does it cost to produce the Huai'an ...
As the world's largest company by market valu...
For the first time, a drone successfully defeated...
In the era of social e-commerce and content e-com...
E-commerce advertising has three main purposes: f...
recently "Two big enemies rowed from Hangzho...
I didn’t originally plan to write this article ab...
Let me explain the principle. First, combine 58.c...
Since its launch, Apple Maps has been criticized ...
Do you remember the animal events that frequently...
With the improvement of mobile chip performance, ...
There is no clear definition of operation and spe...