Google open-sources a “poor man’s version” summary generation NLP model: 1,000 samples can beat humans

This article is reprinted with permission from AI new media Quantum Bit (public account ID: QbitAI). Please contact the source for reprinting.

General language models such as BERT, GPT-2, and XLNet have demonstrated their power to handle a variety of tasks, such as text generation and question answering. When these models are fine-tuned for various language tasks, they can achieve SOTA performance.

All of the above NLP models are "generalists". Although they are comprehensive, they need to be fine-tuned for specific tasks, and the training data sets are also very large, which is beyond the reach of ordinary people.

If we develop a non-general NLP model that is specialized for a specific task, will the performance be improved while reducing the training cost?

This is the " PEGASUS " model released by Google, which is specially designed for machine-generated summaries. It has refreshed the SOTA results in this field and was included in ICML 2020.

The Tianma model can approach the level of human summarization using only 1,000 samples for training, greatly reducing the demand for supervised data and creating the possibility of low-cost use.

From filling in the blanks to generating summaries

The full name of PEGASUS is: Pre-training with Extracted Gap-sentences for Abstractive Summarization. It is to design a self-supervised pre-training objective for gap sentence generation to improve the fine-tuning performance of generated summaries.

In previous NLP research, self-supervised pre-training did not have a clear downstream goal, which could be text generation or summary extraction, and the model tended to be general.

Researchers from Google believe that the closer the self-supervised pre-training goal is to the final downstream task, the better the fine-tuning performance will be.

What do the gap sentences in the title of the paper mean?

In the pre-training of the Tianma model, researchers deleted some sentences from a document and asked the model to perform the restoration task. These sentences deleted in the gap are called gap sentences.

Such a challenging task forces the model to learn the ability to discover general facts and learn how to extract information from the entire document.

Google found that selecting “important” sentences for demasking works best, which makes the output of self-supervised samples more similar to the summary.

The authors selected 12 different datasets with rich and diverse content, including news, scientific papers, patent documents, short stories, emails, legal documents, and usage instructions, demonstrating that the model framework is applicable to a variety of topics.

Compared with the T5 proposed by Google previously, the number of parameters is only 5% of T5.

Google judges the output according to the ROUGE criterion, automatically identifying these sentences by finding the sentences that are most similar to the rest of the document.

ROUGE uses n-gram overlap to calculate the similarity of two texts, with a score ranging from 0 to 100.

1,000 training samples are enough to surpass humans

Although PEGASUS shows excellent performance on large datasets, surprisingly, the "Tianma" model does not require a large number of samples for fine-tuning to achieve near-SOTA performance.

The figure below shows the relationship between the ROUGE score and the number of supervised samples in four selected summary datasets. The dotted line shows the performance of the Transformer encoder-decoder with full supervision but without pre-training.

Compared to the baseline, Tianma performs better in most tasks even with only 1000 fine-tuning samples, considering that in some practical cases the number of samples is several orders of magnitude larger.

This “sample efficiency” greatly improves the practicality of text summarization models because it greatly reduces the scale and cost of supervised data collection.

In addition to the ROUGE scores given by the machine, Google also conducted a "Turing test" to identify the summaries.

Google puts the summaries generated by the model and the summaries extracted by humans together for users to evaluate. Experiments on three different datasets show that human raters sometimes prefer machine-generated summaries.

Of course, the "Tianma" model is not without its shortcomings, and Google found a bug.

The author searched for a passage in the XSum dataset, which mentioned the names of four British frigates. Although the number 4 was not mentioned throughout the passage, Tianma still correctly extracted the number of frigates.

There was no problem when the number of warships ranged from 2 to 5, but when the number increased to 6, Tianma mistakenly thought there were 7. This shows that the model's "symbolic reasoning" is limited.

Finally, to support ongoing research and ensure reproducibility, Google has released the code, model checkpoints, and other summary datasets of Tianma on GitHub.

Portal

Blog address:
https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html

Paper address:
https://arxiv.org/abs/1912.08777

Code address:
https://github.com/google-research/pegasus

<<: How to unlock your phone if you forget your password? No need to spend money to flash the phone, just press 2 buttons in 10 seconds and it will be done

>>: The disappearance of Android's "soul", why do people no longer like to flash their phones?

App promotion and application insight report!

Google open-sources a “poor man’s version” summary generation NLP model: 1,000 samples can beat humans

From filling in the blanks to generating summaries

1,000 training samples are enough to surpass humans

Portal

App promotion and application insight report!

Exploration and practice of Ctrip Hotel's unified cloud mobile phone platform

How to accurately classify users and reduce promotion costs?

How to promote and attract traffic to the member points mini program?

Summary of the latest information flow advertising platforms in 2017

How much does it cost to join the Chaozhou fast food mini program?

In 2021, how to build a good membership marketing system?

Introduction to home decoration space design

Logical analysis of bidding coefficients for competitive bidding promotions!

How did Pinduoduo rise? An article explains the underlying logic of its traffic

Recommend

From Himalaya’s “123 Carnival”: How can content payment continue to explode?

In-depth analysis of the Douyin algorithm mechanism

Build a product retention analysis system from 0 to 1

What should I pay attention to when applying for Baidu Account Manager qualifications?

Revealing the transformation and exit directions of 40 P2P platforms (attached table)

How much does it cost to develop a Zunyi group buying mini program? Zunyi group buying applet development price inquiry

NetEase Cloud Music's marketing campaigns in the first half of 2017

How to operate a community from 0 to 1? Here are my 7 thoughts

Momo advertising promotion, what are Momo advertising resources?

What is the Internet and how does it work?

From imitation to surpassing the original, 10 key factors of APP "plagiarism"

How to optimize advertising creatives to increase efficiency by 50%?

How to create user portraits and find accurate users?

Reduce costs by 50%, tips for acquiring customers through advertising in children’s programming circle of friends!

Android GO system is developing slowly: only a dozen apps are available