GNMT - Google's neural network translation system

1. Introduction

In September 2016, Google released a neural network-based translation system (GNMT), claiming that GNMT reduced translation errors by more than 55%-85% in translations of multiple major language pairs, and published the technical details of this translation system in a paper (

Google's Neural Machine Translation System: Bridging the Gap

between Human and Machine Translation

) is on display and I benefited a lot from reading it.

II. Overview

Generally speaking, NMT [neural network translation system] usually contains two RNN [recursive neural network], one for receiving input text and the other for generating target sentences. At the same time, the currently popular attention mechanism [attention mechanism] is introduced to make the system more accurate and efficient when processing long sentences. However, Google believes that such neural network systems usually have three weaknesses:

The training speed is slow and requires huge computing resources. Due to the large number of parameters, its translation speed is also much lower than the traditional phrase-based translation system [PBMT].
The handling of rare words is very weak, and directly copying the original words is definitely not a good solution in many cases.
When processing long sentences, some words may be missed.

GNMT is committed to solving the above three problems. In GNMT, RNN uses an 8-layer neural network (actually the encoder is 9 layers, and the input layer is a bidirectional LSTM). Residual connections can help transfer certain information, such as gradients and position information. At the same time, the attention layer is connected to the bottom layer of the decoder and the top layer of the encoder, as shown in the figure:

GNMT structure diagram

In order to solve the problem of translation speed, Google used low-precision algorithms (limiting some parameters in the model to 8 bits) and TPU in the translation process.
In order to better handle low-frequency words, Google uses sub-word units, also called wordpieces, in input and output (for example, splitting 'higher' into 'high' and 'er' and processing them separately).
*In beamsearch, Google added length normalization and coverage penalty to make the processing of sentences of different lengths generated during the translation process more efficient and reduce the number of missed translations in the model.

After making so many improvements, Google claims that the error rate in multiple language pairs such as English-French, English-Chinese, and English-Spanish has been reduced by 60% compared with the previous PBMT system, and is close to the average human translation level.

Next, let’s take a closer look at the details of the magical GNMT model.

3. Model Structure

As shown in the figure above, GNMT has three components like the usual models - an encoder, a decoder, and an attention network. The encoder converts the input sentence into a series of vectors, each vector represents a word in the original sentence, and the decoder uses these vectors and its own generated words to generate the next word. The encoder and decoder are connected through the attention network, which allows the decoder to focus on different parts of the original sentence when generating the target word.

In addition, as we expected, in order for the translation system to have a good accuracy, the RNN networks of the encoder and decoder must be deep enough to obtain details in the original and target sentences that are not easily noticed. In Google's experiments, each additional layer will reduce PPL by about 10%.

Regarding the attention mechanism in the model, the following formula is used for calculation:

The attention formula

Actually, I have a question here. Is the a i in the above a vector or a scalar (a value)? It seems to be a scalar, but I previously understood that attention is a vector with the same length as the number of words in the input sentence.

3.1 Residual Connection

As mentioned above, multi-layer stacked LSTM networks usually have better performance than networks with fewer layers. However, simple staggered stacking will cause slow training and be susceptible to gradient explosion or gradient vanishing. In the experiment, simple stacking works well with 4 layers, and networks with good performance of 6 layers are rare, and 8 layers are even rarer. To solve this problem, residual connections are introduced into the model, as shown in the figure.

Residual connection diagram

The input of the i-th layer and the hidden state of the i-th LSTM are used as the input of the i+1-th LSTM.

LSTM diagram without residual connection

LSTM diagram with residual connection

3.2 Encoder's first layer of bidirectional LSTM

The keywords needed for the translation of a sentence may appear anywhere in the original text, and the information in the original text may be from right to left, or it may be scattered and separated in different positions of the original text. In order to obtain more and more comprehensive information from the original text, bidirectional RNN may be a good choice. In the model structure of this article, bidirectional RNN is only used in the first layer of Encoder, and the remaining layers are still unidirectional RNN.

Bi-directions RNN diagram

As you can see, the pink LSTM processes the sentence from left to right, and the green LSTM processes from right to left. The outputs of the two are first concatenated and then passed to the LSTM of the next layer.

3.3 Parallel training of models

This section mainly introduces some methods to accelerate the model during training.

Google uses both data parallelism and model parallelism to accelerate training. Data parallelism is very direct, that is, to copy and deploy n copies of the model, and the parameters of each copy are shared. Each copy is trained in the form of batches, that is, batch-size sentences are trained at the same time. In Google experiments, n is usually 10, and batch-size is usually 128, and then the parameters are updated using Adam and SGD methods.

In addition to data parallelism, model parallelism is also used in the experiment. That is, each layer of the network is deployed on a GPU, as shown in the top figure. In this way, after the Encoder's first layer of bidirectional RNN is calculated, the next time step can start without waiting for the current time step to be completely completed. Parallel computing speeds up training.

The reason why bidirectional RNN is not used in every layer is that if it is done so, the training speed will be greatly reduced, because it can only use two GPUs, one for forward information processing and one for backward information processing, which reduces the efficiency of parallel computing.

In the attention part, the output layer of the encoder is aligned with the bottom layer of the decoder (my understanding is that the latitude of the output tensor is consistent) to maximize the efficiency of parallel computing. (I don’t understand the specific principle yet).

4. Data Preprocessing

During operation, a neural network translation system usually has a dictionary with a limited number of words, but may encounter countless words, which may cause OOV (out-of-vocabulary) problems. Since these unknown words are usually dates, names, place names, etc., a simple method is to directly copy these words. Obviously, this is not the best solution when dealing with non-names and other words. In Google's model, a better wordpiece model, also called sub-word units, is used. For example, in the sentence "Turing's major is NLP .", after being processed by the WPM model, it should be "Turing 's major is NLP ." In addition, in order to directly copy words such as names, the source language and the target language share the wordpiece model. WPM has achieved a good balance between the flexibility and accuracy of words, and also has better accuracy (BLEU) and faster translation speed.

5. Training Standards

Generally speaking, among N pairs of sentences, the training goal is to optimize the following formula:

log probabilities of the groud-truth outputs given the corresponding inputs

But there is a problem here. The BLEU value in translation cannot reflect the reward or punishment for the quality of a single sentence translation. Furthermore, because the model has never seen an incorrect translation during training, when the model has a high BLEU value, those incorrect sentences will still have a higher probability, so the above formula cannot clearly punish incorrect sentences in translation. (I don’t fully understand this in the original paper and have doubts.)

Therefore, the model needs to be further refined. However, the BLEU value is an evaluation standard for two corpora, and the effect on the evaluation of single sentences is not ideal. Therefore, Google proposed the GLEU value. The general meaning of the GLEU value is to calculate the number of n-grams (n = 1, 2, 3, 4) of the target sentence and the translated sentence respectively, and then calculate the ratio of the size of the intersection of the two sets to the size of the original set, and take the smaller value.

I used Python to calculate the GLEU value. The code is as follows:

 def get_ngrams(s,maxn):
 ngrams = {}
 size = 0  
 for n in range( 1 ,maxn+ 1 ):
 for i in range( 0 ,len(s)):
 for j in range(i+ 1 ,min(i+n+ 1 ,len(s)+ 1 )):
 ngram = ''  
 for word in s[i:j]:
                    ngram += word
                    ngram += ' '  
                ngram = ngram.strip()
 if ngram not in ngrams:
                    ngrams[ngram] = 1  
 size += 1  
 return size,ngrams
 def get_gleu(orig,pred,n= 4 ):
 orig_ = orig.split( ' ' )
 pred_ = pred.split( ' ' )
    n_orig,ngrams_orig = get_ngrams(orig_,n)
    n_pred,ngrams_pred = get_ngrams(pred_,n)
 count_match = 0  
 for v in ngrams_orig:
 if v in ngrams_pred:
 count_match += 1  
 return min(count_match/n_orig,count_match/n_pred)

Therefore, the evaluation criteria of the model after refinement becomes the following:

refinement maximum-likelihood

r(Y, Y (i)) is the calculation part of the GLEU value. GLEU overcomes the shortcomings of BLEU in single sentence evaluation. In this experiment, it can work well with the BLEU value.

To further stabilize the training, Google made a linear combination of the training criteria, which is the following formula:

Mixed maximum-likelihood

α was set to 0.017 during training.

In the actual training process, we first use the O ml standard training to make the model converge, and then use the O mixd standard to further improve the performance of the model.

6. Quantifiable Models and Quantification in Translation

(To be honest, I don’t know how to translate the original text Quantizable Model and Quantized Inference better.)

This part mainly talks about that due to the deep model and large amount of calculation, some problems will arise in the translation process, so Google has taken a series of optimization measures without affecting the model convergence and translation effect.

In the LSTM network with residual connection, there are two values that are continuously transferred and calculated, cit in the time direction and xit in the depth direction. In the experiment, we found that these values are very small. In order to reduce the accumulation of errors, these values are clearly limited to [-δ, δ] during the translation process. Therefore, the original LSTM formula is adjusted as follows:

6.1 modified equation

6.2 Complete LSTM calculation logic

During the translation process, Google replaced all floating-point operations in equations 6.1 and 6.2 with 8-bit or 16-bit fixed-point integer operations, and the weight W was expressed as an 8-bit integer as follows:

Adjustment of weight W

All c i t and x i t are restricted to the range [-δ, δ] and are represented by 16-bit integers.

Matrix multiplication in 6.2 (e.g., W 1 x t ) uses 8-bit fixed-point integer multiplication instead, and all other operations, such as sigmoid, tanh, dot product, addition, etc., use 16-bit integer operations instead.

Assuming that the output of the decoder RNN is y t , then in the softmax layer, the probability vector p t is calculated as follows:

Modified probability vector calculation formula

The logit v t ' is restricted to [-γ,γ] and the weight W s and the weight W in Formula 6.2 are also represented by 8-bit integers and 8-bit matrix multiplication is used during the calculation process.

However, no quantization measures are taken in the softmax layer and attention layer.

It is worth mentioning that, except for limiting c i t and x i t to [-δ, δ] and limiting logit v t ' to [-γ, γ], full-precision floating-point numbers are used throughout the training process. γ is 25.0, and δ is 8.0 at the beginning of training and then gradually changes to 1.0. (During translation, δ is 1.0.)

Log perplexity vs. steps

The red line represents the training process with quantization measures, and the blue line represents normal training. It can be seen that limiting some values to a certain range as an additional planning measure can improve the quality of the model.

7. Decoder

During the translation process, the conventional beam search algorithm was used, but two important optimization schemes were introduced, namely the α and β values in GNMT.

The role of the α value is to normalize the length of the translated sentences, because the possibility of selecting a sentence is obtained by taking the log of the probability of each word in the play and adding them together. These logarithmic probabilities are all negative values. Therefore, to some extent, long sentences will have smaller logarithmic probabilities, which is obviously unreasonable. Therefore, it is necessary to plan the length of the translated sentences.
The role of the β value is to encourage the model to better translate the entire sentence without missing any translations.

The formula applicable to the values of α and β

Where s(Y,X) represents the final score of the translation, and p i j represents the attention value of the i-th word when translating the j-th word.

Google also mentioned two optimization methods in the article:

Firstly, at each step, we only consider tokens that have local scores that are

not more than beamsize below the best token for this step. Secondly, after a normalized best score has

been found according to equation 14, we prune all hypotheses that are more than beamsize below the best

Um, actually I don’t understand how this differs from the regular beam search algorithm. I hope someone can give me some advice. . .

Effect of different α and β values on BLEU scores on the En-Fr corpus

When the values of α and β are 0, it is equivalent to not doing length planning and coverage penalty, and the algorithm returns to the most original beam search algorithm. It is worth mentioning that the model that obtained the above BLEU was not trained only with ML, and no RL optimization was used. This is because RL refinement has already made the model not miss or over-translate.

Comparison of the BLEU values obtained by optimizing with ML and then with RL on the En-Fr corpus

In Google's experiment, α=0.2 and β=0.2, but in our experiment, α=0.6~1 and β=0.2~0.4 will achieve better results in Chinese-English translation.

Experimental process and results

Comparison of experimental results

The model part is basically introduced. In the remaining eighth part, I will post a picture about the experiment and the experimental results. I will continue to add it from time to time. The next article should introduce the FairSeq model released and open-sourced by Facebook, which is claimed to be better and faster than GNMT...

Methods mentioned in the paper but not used

Another way to deal with the OOV problem is to mark rare words. For example, suppose the word Miki does not appear in the dictionary. After marking, it becomes <B> M <M> i <M> k <E> i. In this way, rare words are replaced with special symbols during the translation process.

<<: Introduction to a new implementation mechanism for keeping mobile APP logged in

>>: Who will win the battle of words? ——Front-end and back-end passionate debate

He died of influenza A and influenza B at the age of 41! Why can you get influenza B after getting influenza A?

Join Mi Meng and others to launch a new publication. Let's take a look at how to become a commercial service provider for online content.

For most modern people, there is less and less ti...

Are you getting fatter the more you eat whole wheat bread? What you are eating may be "fake". Here's a trick to help you avoid the pitfall!

This year, the trend of healthy eating is becomin...

Marketing promotion planning: 80% of event planning mistakes are easy to make!

Every marketing planner hopes that his or her act...

App overseas promotion, basic process and core elements of activity operation!

First, compared with the traditional manufacturin...

GNMT - Google's neural network translation system

1. Introduction

II. Overview

3. Model Structure

3.1 Residual Connection

3.2 Encoder's first layer of bidirectional LSTM

3.3 Parallel training of models

4. Data Preprocessing

5. Training Standards

6. Quantifiable Models and Quantification in Translation

7. Decoder

Experimental process and results

Methods mentioned in the paper but not used

He died of influenza A and influenza B at the age of 41! Why can you get influenza B after getting influenza A?

A guide to building a live streaming team for top influencers to sell products!

China Association of Automobile Manufacturers: Brief analysis of passenger car production and sales in March 2022

Canada goldenrod is being hunted down by the entire Internet again? Not unfair! Report it immediately if you see it

Planning and promotion: an advanced guide to planning! (recommended collection)

Summary of methods for upgrading Windows 7/8.1 to Windows 10

Weifang Mini Program Production Company, how much does it cost to produce a wedding banquet mini program?

Beijing Benz GLB purchase trend based on data: driving needs give way to interior space needs

American TV series enter the era of strict inspection ahead of time, and cloud storage becomes the biggest beneficiary

Long comment: iPhone 6 only supports mobile TDD 4G networks

Recommend

Extraordinary Journey | Shenzhou 10, ten years of asking the sky!

Jingzhe丨A spring thunderclap, peach blossoms bloom

How did the community achieve a growth of 0-500 in 2 months?

B station’s precise traffic diversion strategy!

How does keep promote itself? What are the points worth learning from?

No open source, no reliability! AppCan will continue to open source

Join Mi Meng and others to launch a new publication. Let's take a look at how to become a commercial service provider for online content.

Douyin Promotion: The most comprehensive guide to increasing Douyin followers

Google launches Android Auto Beta testing program to encourage users to try and submit feedback

Taking Kaola.com as an example, we analyze the event planning process and innovative gameplay

This year's Lantern Festival coincides with a super small moon! Why does the moon change in size?

A must-read for information flow advertising! 50 platform data & mobile terminal device rankings!

Are you getting fatter the more you eat whole wheat bread? What you are eating may be "fake". Here's a trick to help you avoid the pitfall!

Marketing promotion planning: 80% of event planning mistakes are easy to make!

App overseas promotion, basic process and core elements of activity operation!