Over the past few decades, we have experienced a series of fundamental changes and challenges related to information. Today, the bottleneck is no longer the acquisition of information; in fact, the real challenge is how to digest the huge amount of information. I believe everyone has this experience: we have to read more content to understand the hot information related to work, news, and social media. To solve this challenge, we began to study how to use AI to help people improve their work experience in the tide of information - and one potential solution is to use algorithms to automatically summarize long text content. However, training a model that can generate long, continuous and meaningful summaries is still an open research topic. In fact, generating long texts is a difficult task even for the most advanced deep learning algorithms. To achieve successful summarization, we introduced two independent important improvements: a more contextual vocabulary generation model and a new reinforcement learning (RL) method to train the summarization model. Combining these two training methods means that the overall system can organize longer texts such as news articles into relevant and highly readable multi-sentence summaries, and the actual results are much better than previous solutions. Our algorithm can be trained on different types of texts and excerpt lengths. In today’s blog post, we will introduce the main breakthroughs of this model and explain the challenges related to natural language text summarization.
Figure 1 (click on the original text to see the gif) : Demonstrates how our model generates multi-sentence summaries from news articles. For each generated word, the model refers to the specific word input and the output options given previously. Extraction and Abstraction Automatic summarization models can be implemented in one of two ways: extractive or abstractive. Extractive models perform a "copy and paste" operation, selecting relevant phrases from the input document and connecting them to create a summary. This is very powerful because they use existing natural language expressions directly from the document - but on the other hand, they lack flexibility because they cannot use new vocabulary or connective expressions. In addition, their expression sometimes differs from human habits. Abstractive models, on the other hand, generate summaries based on concrete "abstract" content: they can completely avoid using existing vocabulary from the original input document. This means that such models can generate more fluent and continuous content, but they are also more difficult to implement - because we need to ensure that they are able to generate continuous phrases and connective expressions. Although abstract models are more powerful in theory, they often make mistakes in practice. Typical mistakes include using discontinuous, irrelevant, or repeated phrases in the generated summaries, which are more obvious when trying to create longer text outputs. In addition, they often lack consistency, economy, and readability between contexts. To address these problems, we obviously need to design a more powerful and consistent abstract summarization model. To understand our new abstract model, we need to first define its basic building blocks and then explain our new training approach. Read and generate text using an encoder-decoder model Recurrent Neural Networks (RNNs) are a class of deep learning models that can process sequences of variable length (such as text sequences) and compute available representations (or hidden states) in them piecemeal. Such networks process each element of the sequence (in this case, each word) one by one; for each new input in the sequence, the network computes a new hidden state as a function of that input and the previous hidden state. In this way, the hidden state computed for each word is a function of the entire set of words. Figure 2: A recurrent neural network reads an input sentence using the same function provided by each word (green box). Recurrent neural networks can be used in the same way to generate output sequences. At each step, the hidden state of the recurrent neural network is used to generate a new word, which is added to the final output and incorporated into the next input. Figure 3: Recurrent neural networks can generate sequences of outputs while reusing each output word as input to the next function. Recurrent neural networks are able to combine input (reading) and output (generating) content using a joint model, where the final hidden state of the input recurrent neural network will be used as the initial hidden state of the output recurrent neural network. In this way, the joint model will be able to read any text and generate different text information based on it. This framework is called encoder-decoder recurrent neural network (also known as Seq2Seq) and serves as the basis for the implementation of our summary model. In addition, we will also replace the traditional encoder recurrent neural network with a bidirectional encoder, which uses two different recurrent neural networks to read the input sequence: one reads the text from left to right (as shown in Figure 4), and the other reads from right to left. This will help our model better re-express the input content based on the context. Figure 4: Encoder-decoder recurrent neural network models can be used to solve sequence-to-sequence processing tasks in natural language (such as content summarization). New attention and decoding mechanism To make our model produce more consistent outputs, we use a technique called temporal attention to allow the decoder to look back at the output document when generating new words. Rather than relying entirely on its own hidden state, the decoder uses an attention function to contextualize different parts of the input text. This attention function is then adjusted to ensure that the model can use different inputs as references when generating output text, thereby improving the information coverage of the summary result. Additionally, to ensure that the model does not repeat itself, we allow it to look back at the previous hidden states in the decoder. Here, we define an intra-decoder attention function to look back at the previous hidden states of the decoder RNN. Finally, the decoder combines the context vector from the temporal attention technique with the context vector from the intra-decoder attention function to generate the next word in the output. Figure 5 shows how these two attention functions are combined at a specific decoding step. Figure 5: Two context vectors (labeled ‘C’) computed from the encoder hidden state and the decoder hidden state. These two context vectors are combined with the current decoder hidden state (labeled ‘H’) to generate a new word (right side) and append it to the output sequence. How to train this model? Supervised learning and reinforcement learning The most common approach to training this model on real data, such as news articles, is to use a teacher forcing algorithm: the model generates a new summary using a reference summary and provides word-by-word error warnings (or ‘local supervision’, as shown in Figure 6) each time it generates a new word. Figure 6: Model training process under supervised learning mechanism. Each generated word will get a training supervision signal, which is calculated by comparing the word with the actual summary vocabulary at the same position. This approach can be used to train arbitrary sequence generation models based on recurrent neural networks with very promising results. However, for the specific task we are discussing here, the summary does not necessarily need to be matched word by word to the reference sequence to be judged correct. It is conceivable that two editors may write completely different summaries of the same news article - using different language styles, word choices, and even sentence orders, but both can complete the summary task well. The problem with the teacher-forcing method is that after generating a few words, the entire training process becomes misleading: it needs to strictly follow the formal summary method, and cannot adapt to the same correct but different starting expressions. With this in mind, we should look for a better approach than teacher forcing. Here, we chose a completely different type of training called reinforcement learning (RL). First, the RL algorithm asks the model to generate a summary on its own, and then uses an external scorer to compare the difference between the generated summary and the correct reference text. This score then tells the model how good the summary it generated is. If the score is high, the model can update itself to make the treatment in this summary appear more likely in future treatments. On the other hand, if the score is low, the model will adjust its generation process to prevent similar summaries from being output in the future. This RL model can greatly improve the evaluation of the entire sequence, rather than judging the quality of the summary by analyzing each word. Figure 7: In the reinforcement learning training scheme, the model itself does not receive local supervision based on each word, but relies on the comparison between the overall output results and the reference answer to provide guidance. How to evaluate the quality of abstracts? So what exactly is the scorer mentioned earlier, and how does it determine the actual quality of the summary? Since it is almost impractical for humans to manually evaluate millions of summaries, we need a technique called ROUGE (Review-Oriented Learning Evaluation). ROUGE evaluates the subphrases in the generated summary by comparing them with the subphrases in the reference answer, and does not require the two to be exactly the same. The various variants of ROUGE (including ROUGE-1, ROUGE-2, and ROUGE-L) all use the same working principle, but use different subsequence lengths. Although the scores given by ROUGE are close to human subjective judgment to a large extent, the summaries with the highest ROUGE scores are not necessarily the most readable or smooth. When we train the model, using reinforcement learning training alone will make maximizing ROUGE a hard requirement, which will undoubtedly bring new problems. In fact, when we look at the summaries with the highest ROUGE scores, we find that some of them are almost completely unreadable. To leverage the strengths of both, our model is trained using both teacher forcing and reinforcement learning, hoping to maximize the consistency and readability of the summary content through word-level supervision and comprehensive guidance. Specifically, we found that the ROUGE-optimized reinforcement learning mechanism can significantly improve the ability to emphasize (i.e., ensure that all important information is included), while word-level supervision helps improve language fluency, ultimately making the output more continuous and readable. Figure 8: Combining supervised learning (red arrow) and reinforcement learning (purple arrow), we can see how our model uses both local and global feedback to optimize readability and overall ROUGE score. Until recently, the highest ROUGE-1 score for abstract summaries on the CNN/Daily Mail dataset was 35.46. Our combined supervised and reinforcement learning training scheme has helped our in-decoder attention RNN model improve this score to 39.87, while the pure reinforcement learning model scores as high as 41.16. Figure 9 shows the summary scores of other existing models compared to ours. Although our pure reinforcement learning model has a higher ROUGE score, the supervised reinforcement learning model still outperforms in terms of readability due to its higher content relevance. Note that See et al. used a different data format, so their results cannot be directly compared with our or other models' scores - they are only used for reference.
Figure 9: Content summarization results on the CNN/Daily Mail dataset, including our model and several other existing extraction and abstraction schemes. Output example So how does this improvement actually translate to actual summarization? Here, we split the dataset to generate several multi-sentence summaries. Below are the results for our model and its simpler baseline trained on the CNN/Daily Mail dataset. As you can see, while the summarization has improved significantly, it is still a long way from perfect.
Figure 10: Here are more examples of summaries generated by our model and compared with summaries written by humans for the same articles. To illustrate the significant improvement of our solution on text content summarization, Figure 11 shows the generation results after removing the focus and reinforcement learning training. articleTony Blair has said he does not want to retire until he is 91 – as he unveiled plans to set up a 'cadre' of ex-leaders to advise governments around the world. The defiant 61-year-old former Prime Minister said he had 'decades' still in him and joked that he would 'turn to drink latest' if he ever stepped down from his multitude of global roles. He told Newsweek magazine that his ambition was to former recruit heads of government to go round the world to advise presidents and prime ministers ministers on how to run their countries. In an interview with the magazine Newsweek Mr Blair said he did not want to retire until he was 91 years old Mr Blair said his latest ambition is to recruit former heads of government to advise presidents and prime ministers on how to run their countries Mr Blair said he himself had been 'mentored' by US president Bill Clinton when he took office in 1997. And he said he wanted to build up his organizations, such as his Faith Foundation, so they are 'capable of global changing policy'. Last night, Tory MPs expressed horror at the prospect of Mr Blair remaining in public life for another 30 years. Andrew Bridgen said: 'We all know weak Ed Miliband's called on Tony to give his flailing campaign a boost, but the attention's clearly gone to his head.' (...) Summary (reference answer written by a human)The former Prime Minister claimed he has 'decades' of work left in him. Joked he would 'turn to drink' if he ever stepped down from global roles. Wants to recruit former government heads to advise current leaders. He was 'mentored' by US president Bill Clinton when he started in 1997. Summary (Our Model)Mr Blair said he did not want to retire until he was 91 years old. 61-year-old former prime minister said he would 'turn to drink' if he ever stepped down from his own. He said he wanted to build up his charity to advise presidents and prime ministers on how to run their countries. Mr Blair says he is to recruit former heads of government to go round the world to advise ministers. He says he wants to emulate ex-Israeli president Shimon Peres. Summary (Excluding In-focus and Reinforcement Learning Training)61-year-old former prime minister said he did not want to retire until he was 91 years old. He said he wanted to build up his organizations, such as his Faith Foundation. He said he wanted to emulate ex-Israeli president Shimon Peres. Mr Blair said he wanted to emulate ex-Israeli President Shimon Peres. 1997. Mr Blair said he wanted to Figure 11: Comparison of an example summary generated by our model and the one generated after removing the refinement mechanism. New words that do not appear in the original document are marked in green. Repetitive phrases that appear in the summary are marked in red. SummarizeOur model significantly improves various state-of-the-art techniques used in multi-sentence text summarization, and outperforms existing abstractive and extractive baselines. We believe that our contributions of in-decoder attention modules and composite training objectives can also improve other sequence generation tasks, especially in the case of long text output. Our work also addresses the limitations of automatic evaluation metrics such as ROUGE. According to the results, ideal metrics can indeed better evaluate and optimize content summarization models. Ideal metrics should have basically the same judgment criteria as humans, including consistency and readability of summary content. When we use such metrics to improve the summary model, the quality of the results should be further improved. Citation TipsIf you would like to cite this blog post in a publication, please include the following: Romain Paulus, Caiming Xiong and Richard Socher. 2017. A Deep Reinforcement Model for Abstract Summarization AcknowledgementsSpecial thanks to Melvin Gruesbeck for providing the images and statistics for this article. Original link: https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization |
<<: Building a high-availability App framework for iOS from 0 to 1
>>: Using deep neural networks to solve the problem of NER named entity recognition
The following content is prohibited in WeChat Cir...
[[127747]] Apple CEO Tim Cook attended the Goldma...
1. Wrong concept of love: thinking that only mone...
Just now, friends in the group were discussing th...
Dear developers, The wonderful stories of each de...
With the continuous development of the APP indust...
What is the investment cost of Jixi Rubber and Pl...
[[222139]] One of the key decisions we programmer...
On August 6, People's Daily published an arti...
Drinking water is an essential part of our daily l...
After four years, Apple finally released the new ...
Can an e-commerce business license be used to ope...
The online entrance of Da Pineapple Fujian Naviga...
The concept of " psychological account "...
When searching for relevant images on search engi...