1. IntroductionIn September 2016, Google released a neural network-based translation system (GNMT), claiming that GNMT reduced translation errors by more than 55%-85% in translations of multiple major language pairs, and published the technical details of this translation system in a paper ( Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation ) is on display and I benefited a lot from reading it. II. OverviewGenerally speaking, NMT [neural network translation system] usually contains two RNN [recursive neural network], one for receiving input text and the other for generating target sentences. At the same time, the currently popular attention mechanism [attention mechanism] is introduced to make the system more accurate and efficient when processing long sentences. However, Google believes that such neural network systems usually have three weaknesses:
GNMT is committed to solving the above three problems. In GNMT, RNN uses an 8-layer neural network (actually the encoder is 9 layers, and the input layer is a bidirectional LSTM). Residual connections can help transfer certain information, such as gradients and position information. At the same time, the attention layer is connected to the bottom layer of the decoder and the top layer of the encoder, as shown in the figure: GNMT structure diagram
After making so many improvements, Google claims that the error rate in multiple language pairs such as English-French, English-Chinese, and English-Spanish has been reduced by 60% compared with the previous PBMT system, and is close to the average human translation level. Next, let’s take a closer look at the details of the magical GNMT model. 3. Model StructureAs shown in the figure above, GNMT has three components like the usual models - an encoder, a decoder, and an attention network. The encoder converts the input sentence into a series of vectors, each vector represents a word in the original sentence, and the decoder uses these vectors and its own generated words to generate the next word. The encoder and decoder are connected through the attention network, which allows the decoder to focus on different parts of the original sentence when generating the target word. In addition, as we expected, in order for the translation system to have a good accuracy, the RNN networks of the encoder and decoder must be deep enough to obtain details in the original and target sentences that are not easily noticed. In Google's experiments, each additional layer will reduce PPL by about 10%. Regarding the attention mechanism in the model, the following formula is used for calculation: The attention formula Actually, I have a question here. Is the a i in the above a vector or a scalar (a value)? It seems to be a scalar, but I previously understood that attention is a vector with the same length as the number of words in the input sentence. 3.1 Residual ConnectionAs mentioned above, multi-layer stacked LSTM networks usually have better performance than networks with fewer layers. However, simple staggered stacking will cause slow training and be susceptible to gradient explosion or gradient vanishing. In the experiment, simple stacking works well with 4 layers, and networks with good performance of 6 layers are rare, and 8 layers are even rarer. To solve this problem, residual connections are introduced into the model, as shown in the figure. Residual connection diagram The input of the i-th layer and the hidden state of the i-th LSTM are used as the input of the i+1-th LSTM. LSTM diagram without residual connection LSTM diagram with residual connection 3.2 Encoder's first layer of bidirectional LSTMThe keywords needed for the translation of a sentence may appear anywhere in the original text, and the information in the original text may be from right to left, or it may be scattered and separated in different positions of the original text. In order to obtain more and more comprehensive information from the original text, bidirectional RNN may be a good choice. In the model structure of this article, bidirectional RNN is only used in the first layer of Encoder, and the remaining layers are still unidirectional RNN. Bi-directions RNN diagram As you can see, the pink LSTM processes the sentence from left to right, and the green LSTM processes from right to left. The outputs of the two are first concatenated and then passed to the LSTM of the next layer. 3.3 Parallel training of modelsThis section mainly introduces some methods to accelerate the model during training. Google uses both data parallelism and model parallelism to accelerate training. Data parallelism is very direct, that is, to copy and deploy n copies of the model, and the parameters of each copy are shared. Each copy is trained in the form of batches, that is, batch-size sentences are trained at the same time. In Google experiments, n is usually 10, and batch-size is usually 128, and then the parameters are updated using Adam and SGD methods. In addition to data parallelism, model parallelism is also used in the experiment. That is, each layer of the network is deployed on a GPU, as shown in the top figure. In this way, after the Encoder's first layer of bidirectional RNN is calculated, the next time step can start without waiting for the current time step to be completely completed. Parallel computing speeds up training. The reason why bidirectional RNN is not used in every layer is that if it is done so, the training speed will be greatly reduced, because it can only use two GPUs, one for forward information processing and one for backward information processing, which reduces the efficiency of parallel computing. In the attention part, the output layer of the encoder is aligned with the bottom layer of the decoder (my understanding is that the latitude of the output tensor is consistent) to maximize the efficiency of parallel computing. (I don’t understand the specific principle yet). 4. Data PreprocessingDuring operation, a neural network translation system usually has a dictionary with a limited number of words, but may encounter countless words, which may cause OOV (out-of-vocabulary) problems. Since these unknown words are usually dates, names, place names, etc., a simple method is to directly copy these words. Obviously, this is not the best solution when dealing with non-names and other words. In Google's model, a better wordpiece model, also called sub-word units, is used. For example, in the sentence "Turing's major is NLP .", after being processed by the WPM model, it should be "Turing 's major is NLP ." In addition, in order to directly copy words such as names, the source language and the target language share the wordpiece model. WPM has achieved a good balance between the flexibility and accuracy of words, and also has better accuracy (BLEU) and faster translation speed. 5. Training StandardsGenerally speaking, among N pairs of sentences, the training goal is to optimize the following formula: log probabilities of the groud-truth outputs given the corresponding inputs But there is a problem here. The BLEU value in translation cannot reflect the reward or punishment for the quality of a single sentence translation. Furthermore, because the model has never seen an incorrect translation during training, when the model has a high BLEU value, those incorrect sentences will still have a higher probability, so the above formula cannot clearly punish incorrect sentences in translation. (I don’t fully understand this in the original paper and have doubts.) Therefore, the model needs to be further refined. However, the BLEU value is an evaluation standard for two corpora, and the effect on the evaluation of single sentences is not ideal. Therefore, Google proposed the GLEU value. The general meaning of the GLEU value is to calculate the number of n-grams (n = 1, 2, 3, 4) of the target sentence and the translated sentence respectively, and then calculate the ratio of the size of the intersection of the two sets to the size of the original set, and take the smaller value. I used Python to calculate the GLEU value. The code is as follows:
Therefore, the evaluation criteria of the model after refinement becomes the following: refinement maximum-likelihood r(Y, Y (i)) is the calculation part of the GLEU value. GLEU overcomes the shortcomings of BLEU in single sentence evaluation. In this experiment, it can work well with the BLEU value. To further stabilize the training, Google made a linear combination of the training criteria, which is the following formula: Mixed maximum-likelihood α was set to 0.017 during training. In the actual training process, we first use the O ml standard training to make the model converge, and then use the O mixd standard to further improve the performance of the model. 6. Quantifiable Models and Quantification in Translation(To be honest, I don’t know how to translate the original text Quantizable Model and Quantized Inference better.) This part mainly talks about that due to the deep model and large amount of calculation, some problems will arise in the translation process, so Google has taken a series of optimization measures without affecting the model convergence and translation effect. In the LSTM network with residual connection, there are two values that are continuously transferred and calculated, cit in the time direction and xit in the depth direction. In the experiment, we found that these values are very small. In order to reduce the accumulation of errors, these values are clearly limited to [-δ, δ] during the translation process. Therefore, the original LSTM formula is adjusted as follows: 6.1 modified equation 6.2 Complete LSTM calculation logic During the translation process, Google replaced all floating-point operations in equations 6.1 and 6.2 with 8-bit or 16-bit fixed-point integer operations, and the weight W was expressed as an 8-bit integer as follows: Adjustment of weight W All c i t and x i t are restricted to the range [-δ, δ] and are represented by 16-bit integers. Matrix multiplication in 6.2 (e.g., W 1 x t ) uses 8-bit fixed-point integer multiplication instead, and all other operations, such as sigmoid, tanh, dot product, addition, etc., use 16-bit integer operations instead. Assuming that the output of the decoder RNN is y t , then in the softmax layer, the probability vector p t is calculated as follows: Modified probability vector calculation formula The logit v t ' is restricted to [-γ,γ] and the weight W s and the weight W in Formula 6.2 are also represented by 8-bit integers and 8-bit matrix multiplication is used during the calculation process. However, no quantization measures are taken in the softmax layer and attention layer. It is worth mentioning that, except for limiting c i t and x i t to [-δ, δ] and limiting logit v t ' to [-γ, γ], full-precision floating-point numbers are used throughout the training process. γ is 25.0, and δ is 8.0 at the beginning of training and then gradually changes to 1.0. (During translation, δ is 1.0.) Log perplexity vs. steps The red line represents the training process with quantization measures, and the blue line represents normal training. It can be seen that limiting some values to a certain range as an additional planning measure can improve the quality of the model. 7. DecoderDuring the translation process, the conventional beam search algorithm was used, but two important optimization schemes were introduced, namely the α and β values in GNMT.
The formula applicable to the values of α and β Where s(Y,X) represents the final score of the translation, and p i j represents the attention value of the i-th word when translating the j-th word. Google also mentioned two optimization methods in the article: Firstly, at each step, we only consider tokens that have local scores that are not more than beamsize below the best token for this step. Secondly, after a normalized best score has been found according to equation 14, we prune all hypotheses that are more than beamsize below the best Um, actually I don’t understand how this differs from the regular beam search algorithm. I hope someone can give me some advice. . . Effect of different α and β values on BLEU scores on the En-Fr corpus When the values of α and β are 0, it is equivalent to not doing length planning and coverage penalty, and the algorithm returns to the most original beam search algorithm. It is worth mentioning that the model that obtained the above BLEU was not trained only with ML, and no RL optimization was used. This is because RL refinement has already made the model not miss or over-translate. Comparison of the BLEU values obtained by optimizing with ML and then with RL on the En-Fr corpus In Google's experiment, α=0.2 and β=0.2, but in our experiment, α=0.6~1 and β=0.2~0.4 will achieve better results in Chinese-English translation. Experimental process and resultsComparison of experimental results The model part is basically introduced. In the remaining eighth part, I will post a picture about the experiment and the experimental results. I will continue to add it from time to time. The next article should introduce the FairSeq model released and open-sourced by Facebook, which is claimed to be better and faster than GNMT... Methods mentioned in the paper but not usedAnother way to deal with the OOV problem is to mark rare words. For example, suppose the word Miki does not appear in the dictionary. After marking, it becomes <B> M <M> i <M> k <E> i. In this way, rare words are replaced with special symbols during the translation process. |
<<: Introduction to a new implementation mechanism for keeping mobile APP logged in
>>: Who will win the battle of words? ——Front-end and back-end passionate debate
Some time ago, a customer consulted the editor and...
Among the world's three major mobile operatin...
As a huge consumer market for electric vehicles, ...
Wang Tong · Douyin SEO landing practice class, ha...
I once naively thought Search for resources onlin...
Recently, the Ministry of Housing and Urban-Rural...
Recently, "India has seen a temperature of n...
Source: www.maxpixel.net Insects are found almost...
How to do a good job in App operation and promoti...
“Once you fall into the rabbit hole of conspiracy...
How to borrow money using social security card? T...
It is said that eating half a worm is more terrib...
Supported by a 30-person research team, each sess...
CG Zhong Fenghua Game Scene Class Video Tutorial ...
Product selection is a common topic. Regardless o...