Some of our thoughts and attempts on end-to-end speech translation

Introduction: As of 2019, there are more than 200 countries and regions in the world, and the number of languages used by people is as high as 7,000, including thousands of endangered languages or unwritten languages. Language barriers are often a major obstacle to political, economic, and cultural exchanges between different regions.

Fortunately, with the rapid development of machine translation technology in recent years, especially since the Transformer model was proposed in 2017, machine translation methods based on neural networks have gradually received more attention and have been applied to major commercial machine translation systems, greatly reducing the inconvenience caused by language barriers and promoting communication between people. But at the same time, with the development of the Internet, the information people can obtain on a daily basis is no longer limited to text, and audio and video formats have also become the main means of information transmission. Therefore, how to translate voice information into text in different languages is also a difficult problem to be solved.

Speech Translation Overview

Speech Translation (ST) is the translation of speech in one language into text in another language. It has many important and interesting applications, such as:

Video automatic subtitles

Conference Interpretation

Smart translation hardware

Nowadays, practical commercial speech translation systems are implemented by connecting the Automatic Speech Recognition (ASR) system and the Machine Translation (MT) system in series. The effect of speech translation has been improved to a certain extent with the development of speech recognition and machine translation technology. However, such cascaded systems often have the problem of error accumulation, that is, speech recognition errors will directly lead to errors in machine translation results. In order to solve this problem, with the application and development of sequence-to-sequence modeling methods in machine translation and speech recognition in recent years [1-4], researchers have also begun to explore end-to-end speech translation technology that can directly translate audio into text.

Speech Translation Modeling Method

Cascade Speech Translation

The cascade speech translation system uses a speech recognition module to recognize audio into text, and then uses a machine translation module to translate the text into different languages. The advantage of this method is that it can use large-scale speech recognition data and machine translation data to optimize the two modules to the extreme. However, the speech recognition text has the following characteristics or common errors:

No capitalization or punctuation information
There are oral phenomena, such as modal particles, repetition, etc.
Recognition errors, such as homophone recognition errors, missing words, etc.

Therefore, in practical applications, it still needs to go through some post-processing before it can be input into the translation module. For example:

Spoken language smoothing: Identify and delete spoken language phenomena such as repetition and redundancy contained in the recognized text;
Text Inverse Normalization (ITN): Converting recognized text into written text (such as digital conversion, etc.)
Rewriting/Correction: Rewrite or correct text according to contextual semantics to improve translation accuracy
Add punctuation and standardize capitalization

As you can see, the cascade system allows us to insert various optimization processing modules, and each module can be optimized and replaced separately. However, various modules introduced for correction/error correction may introduce more errors. Therefore, the biggest challenges faced by the cascade speech translation system are error propagation and high computational complexity.

End-to-end speech translation

End-to-end speech translation is a unified model that directly translates speech into text. This is due to the development of the "encoder-decoder" framework, especially its application in machine translation[1] and speech recognition[2-4]. Compared with the cascade model, the advantage of the end-to-end model is that it can alleviate the problem of error propagation and simplify the process of model deployment.

At present, the more commonly used end-to-end speech translation model is still based on Transformer, as shown in the following figure:

Its "encoder-decoder" backbone structure is the standard Transformer, and the only difference from Transformer-based neural machine translation is that the word vector at the input end becomes an audio representation. As we know, after the audio file is read into the computer program, it is represented as a series of discrete sampling points (i.e., the amplitude of the sound vibration in the medium). Assuming that the audio sampling rate is 16,000 (i.e., 16,000 sampling points per second), even if it is only a few seconds of audio, the sequence read into the program will be very long. Therefore, audio feature extraction is required before it is officially input into the Transformer. The following figure shows the two most commonly used audio feature extraction methods in the end-to-end speech translation model:

Based on acoustic features: First, traditional acoustic features are extracted, such as mel-frequency cepstral coefficient (MFCC) and log-mel filterbank coefficient (FBank). The obtained acoustic features are a matrix of "number of audio frames × number of acoustic features", where the audio frame can be regarded as a temporal dimension, but it is still slightly longer than the number of words in the text. Therefore, after FBank/MFCC, several layers of convolutional neural networks are often connected to further extract and reduce the acoustic features.
Based on unsupervised pre-training model: Unsupervised speech pre-training model has been a hot research direction in the past two years. It does not need to extract acoustic features, but uses deep neural network models to directly train speech feature representation based on large-scale audio data. Experiments on multiple downstream speech tasks have shown that pre-trained speech representation is better than traditional acoustic features [5]. As shown in Figure (2) above, the classic wav2vec2.0 [6], the audio signal will be extracted and reduced in dimension through a 7-layer convolutional network, and then passed through several layers of Transformer Block to obtain audio feature representation with context information.

The potential and challenges of end-to-end speech translation

The end-to-end modeling approach has greater potential than the traditional cascade speech translation, as demonstrated by the following derivation (X represents the audio input, S and T represent the speech recognition result and translation result, respectively) [7]:

Formula (1): This is the end-to-end speech translation model, which directly generates the translation T from the audio X;
Formula (2): A new variable S is introduced and is the conditional probability expansion form of (1);
Formula (3): We use a text translation model to approximate P(T | S, X). Obviously, there is information loss in this step because we ignore the original audio input, which causes the translation model to be unable to truly capture the speaker's tone, emotion, attitude, etc., and thus there may be ambiguity;
Formula (4): This is the cascaded speech translation model, which directly takes the top 1 result output by the speech recognition model and passes it to the machine translation model. This brings us back to the shortcomings of the cascade model mentioned above: one is that the output of speech recognition does not match the machine translation (such as colloquialism, no punctuation, and even domain mismatch), and the other is the problem of error propagation, especially in commercial speech translation systems, which often include modules such as oral smoothness and punctuation recovery, which potentially accumulate more prediction errors of the machine learning model and increase the complexity of the model.

Therefore, whether in terms of model complexity or effect, end-to-end modeling methods have greater potential.

We can actually improve the performance of the cascade model by enhancing the robustness of the machine translation model, using the Top K results of ASR output (such as lattice), or combining more audio information in the punctuation module/translation module, but this article does not discuss this in detail because a fair comparison of these methods requires large-scale and matched training data to support it, and the current data scale of end-to-end speech translation is not sufficient to do this.

If end-to-end speech translation has great potential, why are current commercial speech translation systems still cascaded?

This is related to the biggest shortcoming of the end-to-end approach: scarce data resources.

Taking the open source data in academia as an example, WMT data is commonly used in machine translation research. WMT21 En-De contains more than 40 million parallel corpora. In addition to OpenSubtitles (film and television subtitles), CCMatrix (cleaned from CommonCrawl) [8], etc., the En-De language alone can accumulate hundreds of millions of parallel corpora. For speech recognition tasks, researchers also released the GigaSpeech dataset in 2021, which contains 10,000 hours of annotated English audio data. For end-to-end speech translation tasks, the more commonly used dataset is MuST-C [9]. The En-De language contains 400 hours of audio and corresponding 250,000 sentences of transcription and translation. The data scale is far less than that of machine translation and speech recognition tasks.

The main reason is that the process of building a speech translation dataset is complicated and costly. For example, we need to first find a data source that can meet the following requirements: audio with public or authorized content, and corresponding transcription and translation. Then we need to segment the audio, transcription, and translation, and finally align and filter them. After this series of operations, the scale of valid data obtained will not be very large. For the industry, labeling thousands or tens of thousands of hours of speech translation data also consumes a lot of manpower, financial resources, and time.

To this end, researchers have proposed many methods to improve the effect of end-to-end speech translation, such as more effectively utilizing large-scale speech recognition and machine translation data, introducing pre-trained models, redesigning encoders and decoders, etc. We have also accumulated a series of work in this direction.

Some exploration and experimentation

We try to leverage data from speech recognition and machine translation to enhance end-to-end speech translation, focusing on three aspects: more efficient encoders and decoders, training techniques and strategies, and data augmentation.

LUT (AAAI 2021): Listen, Understand, and Translate

Paper address: https://ojs.aaai.org/index.php/AAAI/article/view/17509

Further reading: https://mp.weixin.qq.com/s/D0BnXHh1w0AuCBBhv0nFBQ

The article believes that the existing Transformer-based end-to-end speech translation model has two shortcomings:

It is difficult to do both audio signal analysis and semantic understanding with only one encoder;
The information transcribed by ASR cannot be used. Therefore, the article introduces two encoders: acoustic encoder and semantic encoder. The acoustic encoder is responsible for parsing the audio signal and matching it with the transcribed text representation, which is where ASR can be used.
The semantic encoder receives the output of the acoustic encoder and performs semantic understanding.

COSTT (AAAI 2021): Simultaneous speech recognition and translation

Paper address: https://ojs.aaai.org/index.php/AAAI/article/view/17508

Further reading: https://mp.weixin.qq.com/s/Af6p1jVlkePrIZmUrjIaNw

Although end-to-end speech translation directly receives audio as input, cross-modal translation from speech to text is more difficult. On the other hand, considering that human translators usually record some keywords in the source language to help with translation when doing consecutive or simultaneous interpretation, the article proposes a "continuous prediction" method in the decoding process of the sequence-to-sequence model, that is, let the decoder of the end-to-end speech translation model first predict the transcription result of the original audio, and then continue to predict the translation result. In this way, the self-attention mechanism on the decoder side can "reference" the transcribed content of the audio when generating the translation. At the same time, the decoder alone is actually a bilingual language model, which can be pre-trained using the parallel corpus in the text translation, and effectively alleviates the problem of scarce speech translation training data.

Chimera (ACL 2021): Unifying speech and text

Paper address: https://aclanthology.org/2021.findings-acl.195/

Further reading: https://mp.weixin.qq.com/s/G_sqv9kAebm-PvIcu1hGHQ

In daily life, do we have such an experience that when listening to some songs with strong rhythm and lyrics, our work efficiency will be seriously reduced. Cognitive neuroscience has a relevant explanation for this, that is, after the sound and text signals are transmitted to the brain, they will share some processing paths. The Chimera model proposed in the article models this idea. After the voice/text input, it will first be encoded by its own acoustic encoder/text encoder, and then the truly useful semantic information will be extracted through several common "memory elements". The model does not distinguish whether this set of semantic information comes from audio or text at the beginning. Therefore, the model can obtain a shared semantic space that models audio and text at the same time. In addition, the link from text input to translation output can also be trained with more additional text translation data, further alleviating the problem of insufficient speech translation corpus.

XSTNet (InterSpeech 2021): A Progressive Multi-Task Learning Framework

Paper address: https://www.isca-speech.org/archive/interspeech_2021/ye21_interspeech.html

In order to make better use of speech recognition, text translation, and speech translation data, the article designs a model that can do these three tasks at the same time. The model's encoder supports simultaneous input of text and audio, and they share the parameters of the entire encoder. When decoding and generating, the language of the generated sentence is used as the sentence start marker (if it is consistent with the audio language, then the task is recognition; if it is inconsistent, it is a translation task). In addition, the article also proposes a progressive learning method (progressive training), which is to first pre-train the entire network with text translation data, and then gradually add speech recognition and speech translation tasks to tune together. Experiments show that this model training method will be better than fine-tuning with only speech translation tasks.

IWSLT 2021 Offline Speech Translation Evaluation System

Paper address: https://aclanthology.org/2021.iwslt-1.6

The article attempts to explore the upper limit of the end-to-end system's capabilities. It introduces more speech recognition data and machine translation data, and combines multi-task learning methods, pseudo-labeling technology, model integration and other methods to improve the performance of end-to-end speech translation by nearly 8 BLEU, and gradually narrows the gap with the cascade system.

NeurST (ACL 2021 Demo): End-to-end speech translation toolkit and experimental benchmark

Paper address: https://aclanthology.org/2021.acl-demo.7/

Project address: https://github.com/bytedance/neurst

The article introduces an end-to-end speech translation toolkit, which can easily insert and modify various data preprocessing modules, encoders, decoder structures, etc. based on structured design. It also provides data preprocessing, training and inference scripts for standardized speech translation datasets such as libri-trans and MuST-C, as well as experimental benchmark results.

STEMM (ACL 2022): Cross-modal hybrid training alleviates the modality gap

Paper address: https://aclanthology.org/2022.acl-long.486/

Some recent research works have been trying to introduce more text translation data to alleviate the problem of scarce end-to-end speech translation data. However, there is a problem of inconsistent representation between speech and text, which the article calls the Modality Gap. At this time, it is difficult for the model to learn useful knowledge for speech translation from text translation data. In order to make more effective use of text translation data, the article proposes to randomly replace part of the speech representation with the corresponding text representation during training, and obtain a sequence of mixed speech and text representations, so that the model can learn the semantic space shared between speech and text modalities. At the same time, the model uses multi-task learning to make the translation results generated by the original audio closer to the results generated by the mixed representation, thereby improving the speech translation quality at the final decoding.

ConST (NAACL 2022): Contrastive Learning to Solve the Modality Gap

Paper address: https://arxiv.org/abs/2205.02444

Based on XSTNet, this article explores the problem of modality gap. The article believes that the speech representation and text representation of the same sentence should be similar in the semantic space under the framework of multi-task learning. To this end, the article proposes to use a contrastive learning loss term to bring the speech and text representations of the same sentence closer, thereby making better use of additional text translation data to improve the performance of speech translation.

MOSST (ACL 2022): End-to-end simultaneous interpretation based on word segmentation

Paper address: https://aclanthology.org/2022.acl-long.50/

Streaming speech translation requires translating real-time speech input into text. Traditional end-to-end speech translation systems generally use a fixed input duration as the basis for judging the reading and writing of the system. This approach faces two major problems: on the one hand, for long speech messages, it cannot guarantee that each input speech message is complete, resulting in reduced translation quality; on the other hand, for short speech messages, it cannot end the reading in advance, resulting in increased translation delay. In order to better judge the reading and writing timing of streaming speech translation, the article introduces a monotonic segmentation module that can detect the boundaries of the audio stream and realize dynamic reading and writing. Experiments have found that the new module comprehensively surpasses previous streaming speech translation models in terms of both latency and quality, while also enabling the model to perform well in non-streaming scenarios.

Conclusion

We believe that end-to-end speech translation is not only less complex than the cascade system, but also has greater potential in terms of effect. Based on our previous exploration, we tried to combine data enhancement, multi-task learning, pre-training and other methods to build a Chinese-English and English-Chinese end-to-end speech translation system, which has a good translation effect in daily conversation translation. At the same time, we used LightSeq[10] to improve the model inference speed. The service response time is more than 70% faster than the cascade system. It can currently be experienced in the "Volcano Translation Mini Program-Audio Translation".

<<: Protect Android user privacy, start with these things

>>: Ctrip's exploration and practice of physical linking technology

Why is the epidemic in Shenzhen so serious in 2022? Is the factory shut down? Attached is the latest regulations for entering and leaving Shenzhen

Blog

Today’s Rain｜Spring rain is as precious as oil, but why is it so precious?

Blog

Confucius had three thousand disciples, why did he love Yan Hui the most?

Blog

Zero-cost, zero-threshold money-making project: Douyin Kuaishou wallpaper project Buddhist gameplay, cashing in 500+ a day

Blog

A hundred years ago, would anyone who said "smoking is harmful to health" be considered a lunatic?

Speech Translation Overview

Video automatic subtitles

Conference Interpretation

Smart translation hardware

Speech Translation Modeling Method

Cascade Speech Translation

End-to-end speech translation

The potential and challenges of end-to-end speech translation

Some exploration and experimentation

LUT (AAAI 2021): Listen, Understand, and Translate

COSTT (AAAI 2021): Simultaneous speech recognition and translation

Chimera (ACL 2021): Unifying speech and text

XSTNet (InterSpeech 2021): A Progressive Multi-Task Learning Framework

IWSLT 2021 Offline Speech Translation Evaluation System

NeurST (ACL 2021 Demo): End-to-end speech translation toolkit and experimental benchmark

STEMM (ACL 2022): Cross-modal hybrid training alleviates the modality gap

ConST (NAACL 2022): Contrastive Learning to Solve the Modality Gap

MOSST (ACL 2022): End-to-end simultaneous interpretation based on word segmentation

Conclusion

Recommend