This article is reprinted with permission from AI new media Quantum Bit (public account ID: QbitAI). Please contact the source for reprinting. General language models such as BERT, GPT-2, and XLNet have demonstrated their power to handle a variety of tasks, such as text generation and question answering. When these models are fine-tuned for various language tasks, they can achieve SOTA performance. All of the above NLP models are "generalists". Although they are comprehensive, they need to be fine-tuned for specific tasks, and the training data sets are also very large, which is beyond the reach of ordinary people. If we develop a non-general NLP model that is specialized for a specific task, will the performance be improved while reducing the training cost? This is the " PEGASUS " model released by Google, which is specially designed for machine-generated summaries. It has refreshed the SOTA results in this field and was included in ICML 2020. The Tianma model can approach the level of human summarization using only 1,000 samples for training, greatly reducing the demand for supervised data and creating the possibility of low-cost use. From filling in the blanks to generating summariesThe full name of PEGASUS is: Pre-training with Extracted Gap-sentences for Abstractive Summarization. It is to design a self-supervised pre-training objective for gap sentence generation to improve the fine-tuning performance of generated summaries. In previous NLP research, self-supervised pre-training did not have a clear downstream goal, which could be text generation or summary extraction, and the model tended to be general. Researchers from Google believe that the closer the self-supervised pre-training goal is to the final downstream task, the better the fine-tuning performance will be. What do the gap sentences in the title of the paper mean? In the pre-training of the Tianma model, researchers deleted some sentences from a document and asked the model to perform the restoration task. These sentences deleted in the gap are called gap sentences. Such a challenging task forces the model to learn the ability to discover general facts and learn how to extract information from the entire document. Google found that selecting “important” sentences for demasking works best, which makes the output of self-supervised samples more similar to the summary. The authors selected 12 different datasets with rich and diverse content, including news, scientific papers, patent documents, short stories, emails, legal documents, and usage instructions, demonstrating that the model framework is applicable to a variety of topics. Compared with the T5 proposed by Google previously, the number of parameters is only 5% of T5. Google judges the output according to the ROUGE criterion, automatically identifying these sentences by finding the sentences that are most similar to the rest of the document. ROUGE uses n-gram overlap to calculate the similarity of two texts, with a score ranging from 0 to 100. 1,000 training samples are enough to surpass humansAlthough PEGASUS shows excellent performance on large datasets, surprisingly, the "Tianma" model does not require a large number of samples for fine-tuning to achieve near-SOTA performance. The figure below shows the relationship between the ROUGE score and the number of supervised samples in four selected summary datasets. The dotted line shows the performance of the Transformer encoder-decoder with full supervision but without pre-training. Compared to the baseline, Tianma performs better in most tasks even with only 1000 fine-tuning samples, considering that in some practical cases the number of samples is several orders of magnitude larger. This “sample efficiency” greatly improves the practicality of text summarization models because it greatly reduces the scale and cost of supervised data collection. In addition to the ROUGE scores given by the machine, Google also conducted a "Turing test" to identify the summaries. Google puts the summaries generated by the model and the summaries extracted by humans together for users to evaluate. Experiments on three different datasets show that human raters sometimes prefer machine-generated summaries. Of course, the "Tianma" model is not without its shortcomings, and Google found a bug. The author searched for a passage in the XSum dataset, which mentioned the names of four British frigates. Although the number 4 was not mentioned throughout the passage, Tianma still correctly extracted the number of frigates. There was no problem when the number of warships ranged from 2 to 5, but when the number increased to 6, Tianma mistakenly thought there were 7. This shows that the model's "symbolic reasoning" is limited. Finally, to support ongoing research and ensure reproducibility, Google has released the code, model checkpoints, and other summary datasets of Tianma on GitHub. Portal Blog address: Paper address: Code address: |
>>: The disappearance of Android's "soul", why do people no longer like to flash their phones?
As of 6 a.m. on December 1: Himalaya's "...
There are thousands of ways to play Tik Tok . It ...
In the capital winter, everyone is talking about ...
What should I pay attention to when applying for ...
Since the release of the "Interim Measures&q...
WeChat Mini Program is an application that users ...
NetEase Cloud Music is a music product that focus...
Next, I will talk about building a community from...
For video social networking, come to Momo! Momo, ...
A huge network formed by connecting networks with...
Imitation is the best way for many teams to get s...
“It is important to have a teacher in learning, b...
Two nights ago, a young girl who had just started...
In recent years, with the rapid development of ar...
At the Google I/O conference in 2017, Google anno...