Introduction: As of 2019, there are more than 200 countries and regions in the world, and the number of languages used by people is as high as 7,000, including thousands of endangered languages or unwritten languages. Language barriers are often a major obstacle to political, economic, and cultural exchanges between different regions. Fortunately, with the rapid development of machine translation technology in recent years, especially since the Transformer model was proposed in 2017, machine translation methods based on neural networks have gradually received more attention and have been applied to major commercial machine translation systems, greatly reducing the inconvenience caused by language barriers and promoting communication between people. But at the same time, with the development of the Internet, the information people can obtain on a daily basis is no longer limited to text, and audio and video formats have also become the main means of information transmission. Therefore, how to translate voice information into text in different languages is also a difficult problem to be solved. Speech Translation OverviewSpeech Translation (ST) is the translation of speech in one language into text in another language. It has many important and interesting applications, such as: Video automatic subtitlesConference InterpretationSmart translation hardwareNowadays, practical commercial speech translation systems are implemented by connecting the Automatic Speech Recognition (ASR) system and the Machine Translation (MT) system in series. The effect of speech translation has been improved to a certain extent with the development of speech recognition and machine translation technology. However, such cascaded systems often have the problem of error accumulation, that is, speech recognition errors will directly lead to errors in machine translation results. In order to solve this problem, with the application and development of sequence-to-sequence modeling methods in machine translation and speech recognition in recent years [1-4], researchers have also begun to explore end-to-end speech translation technology that can directly translate audio into text. Speech Translation Modeling MethodCascade Speech TranslationThe cascade speech translation system uses a speech recognition module to recognize audio into text, and then uses a machine translation module to translate the text into different languages. The advantage of this method is that it can use large-scale speech recognition data and machine translation data to optimize the two modules to the extreme. However, the speech recognition text has the following characteristics or common errors:
Therefore, in practical applications, it still needs to go through some post-processing before it can be input into the translation module. For example:
As you can see, the cascade system allows us to insert various optimization processing modules, and each module can be optimized and replaced separately. However, various modules introduced for correction/error correction may introduce more errors. Therefore, the biggest challenges faced by the cascade speech translation system are error propagation and high computational complexity. End-to-end speech translationEnd-to-end speech translation is a unified model that directly translates speech into text. This is due to the development of the "encoder-decoder" framework, especially its application in machine translation[1] and speech recognition[2-4]. Compared with the cascade model, the advantage of the end-to-end model is that it can alleviate the problem of error propagation and simplify the process of model deployment. At present, the more commonly used end-to-end speech translation model is still based on Transformer, as shown in the following figure: Its "encoder-decoder" backbone structure is the standard Transformer, and the only difference from Transformer-based neural machine translation is that the word vector at the input end becomes an audio representation. As we know, after the audio file is read into the computer program, it is represented as a series of discrete sampling points (i.e., the amplitude of the sound vibration in the medium). Assuming that the audio sampling rate is 16,000 (i.e., 16,000 sampling points per second), even if it is only a few seconds of audio, the sequence read into the program will be very long. Therefore, audio feature extraction is required before it is officially input into the Transformer. The following figure shows the two most commonly used audio feature extraction methods in the end-to-end speech translation model:
The potential and challenges of end-to-end speech translationThe end-to-end modeling approach has greater potential than the traditional cascade speech translation, as demonstrated by the following derivation (X represents the audio input, S and T represent the speech recognition result and translation result, respectively) [7]:
Therefore, whether in terms of model complexity or effect, end-to-end modeling methods have greater potential.
If end-to-end speech translation has great potential, why are current commercial speech translation systems still cascaded? This is related to the biggest shortcoming of the end-to-end approach: scarce data resources. Taking the open source data in academia as an example, WMT data is commonly used in machine translation research. WMT21 En-De contains more than 40 million parallel corpora. In addition to OpenSubtitles (film and television subtitles), CCMatrix (cleaned from CommonCrawl) [8], etc., the En-De language alone can accumulate hundreds of millions of parallel corpora. For speech recognition tasks, researchers also released the GigaSpeech dataset in 2021, which contains 10,000 hours of annotated English audio data. For end-to-end speech translation tasks, the more commonly used dataset is MuST-C [9]. The En-De language contains 400 hours of audio and corresponding 250,000 sentences of transcription and translation. The data scale is far less than that of machine translation and speech recognition tasks. The main reason is that the process of building a speech translation dataset is complicated and costly. For example, we need to first find a data source that can meet the following requirements: audio with public or authorized content, and corresponding transcription and translation. Then we need to segment the audio, transcription, and translation, and finally align and filter them. After this series of operations, the scale of valid data obtained will not be very large. For the industry, labeling thousands or tens of thousands of hours of speech translation data also consumes a lot of manpower, financial resources, and time. To this end, researchers have proposed many methods to improve the effect of end-to-end speech translation, such as more effectively utilizing large-scale speech recognition and machine translation data, introducing pre-trained models, redesigning encoders and decoders, etc. We have also accumulated a series of work in this direction. Some exploration and experimentationWe try to leverage data from speech recognition and machine translation to enhance end-to-end speech translation, focusing on three aspects: more efficient encoders and decoders, training techniques and strategies, and data augmentation. LUT (AAAI 2021): Listen, Understand, and TranslatePaper address: https://ojs.aaai.org/index.php/AAAI/article/view/17509 Further reading: https://mp.weixin.qq.com/s/D0BnXHh1w0AuCBBhv0nFBQ The article believes that the existing Transformer-based end-to-end speech translation model has two shortcomings:
COSTT (AAAI 2021): Simultaneous speech recognition and translationPaper address: https://ojs.aaai.org/index.php/AAAI/article/view/17508 Further reading: https://mp.weixin.qq.com/s/Af6p1jVlkePrIZmUrjIaNw Although end-to-end speech translation directly receives audio as input, cross-modal translation from speech to text is more difficult. On the other hand, considering that human translators usually record some keywords in the source language to help with translation when doing consecutive or simultaneous interpretation, the article proposes a "continuous prediction" method in the decoding process of the sequence-to-sequence model, that is, let the decoder of the end-to-end speech translation model first predict the transcription result of the original audio, and then continue to predict the translation result. In this way, the self-attention mechanism on the decoder side can "reference" the transcribed content of the audio when generating the translation. At the same time, the decoder alone is actually a bilingual language model, which can be pre-trained using the parallel corpus in the text translation, and effectively alleviates the problem of scarce speech translation training data. Chimera (ACL 2021): Unifying speech and textPaper address: https://aclanthology.org/2021.findings-acl.195/ Further reading: https://mp.weixin.qq.com/s/G_sqv9kAebm-PvIcu1hGHQ In daily life, do we have such an experience that when listening to some songs with strong rhythm and lyrics, our work efficiency will be seriously reduced. Cognitive neuroscience has a relevant explanation for this, that is, after the sound and text signals are transmitted to the brain, they will share some processing paths. The Chimera model proposed in the article models this idea. After the voice/text input, it will first be encoded by its own acoustic encoder/text encoder, and then the truly useful semantic information will be extracted through several common "memory elements". The model does not distinguish whether this set of semantic information comes from audio or text at the beginning. Therefore, the model can obtain a shared semantic space that models audio and text at the same time. In addition, the link from text input to translation output can also be trained with more additional text translation data, further alleviating the problem of insufficient speech translation corpus. XSTNet (InterSpeech 2021): A Progressive Multi-Task Learning FrameworkPaper address: https://www.isca-speech.org/archive/interspeech_2021/ye21_interspeech.html In order to make better use of speech recognition, text translation, and speech translation data, the article designs a model that can do these three tasks at the same time. The model's encoder supports simultaneous input of text and audio, and they share the parameters of the entire encoder. When decoding and generating, the language of the generated sentence is used as the sentence start marker (if it is consistent with the audio language, then the task is recognition; if it is inconsistent, it is a translation task). In addition, the article also proposes a progressive learning method (progressive training), which is to first pre-train the entire network with text translation data, and then gradually add speech recognition and speech translation tasks to tune together. Experiments show that this model training method will be better than fine-tuning with only speech translation tasks. IWSLT 2021 Offline Speech Translation Evaluation SystemPaper address: https://aclanthology.org/2021.iwslt-1.6 The article attempts to explore the upper limit of the end-to-end system's capabilities. It introduces more speech recognition data and machine translation data, and combines multi-task learning methods, pseudo-labeling technology, model integration and other methods to improve the performance of end-to-end speech translation by nearly 8 BLEU, and gradually narrows the gap with the cascade system. NeurST (ACL 2021 Demo): End-to-end speech translation toolkit and experimental benchmarkPaper address: https://aclanthology.org/2021.acl-demo.7/ Project address: https://github.com/bytedance/neurst The article introduces an end-to-end speech translation toolkit, which can easily insert and modify various data preprocessing modules, encoders, decoder structures, etc. based on structured design. It also provides data preprocessing, training and inference scripts for standardized speech translation datasets such as libri-trans and MuST-C, as well as experimental benchmark results. STEMM (ACL 2022): Cross-modal hybrid training alleviates the modality gapPaper address: https://aclanthology.org/2022.acl-long.486/ Some recent research works have been trying to introduce more text translation data to alleviate the problem of scarce end-to-end speech translation data. However, there is a problem of inconsistent representation between speech and text, which the article calls the Modality Gap. At this time, it is difficult for the model to learn useful knowledge for speech translation from text translation data. In order to make more effective use of text translation data, the article proposes to randomly replace part of the speech representation with the corresponding text representation during training, and obtain a sequence of mixed speech and text representations, so that the model can learn the semantic space shared between speech and text modalities. At the same time, the model uses multi-task learning to make the translation results generated by the original audio closer to the results generated by the mixed representation, thereby improving the speech translation quality at the final decoding. ConST (NAACL 2022): Contrastive Learning to Solve the Modality GapPaper address: https://arxiv.org/abs/2205.02444 Based on XSTNet, this article explores the problem of modality gap. The article believes that the speech representation and text representation of the same sentence should be similar in the semantic space under the framework of multi-task learning. To this end, the article proposes to use a contrastive learning loss term to bring the speech and text representations of the same sentence closer, thereby making better use of additional text translation data to improve the performance of speech translation. MOSST (ACL 2022): End-to-end simultaneous interpretation based on word segmentationPaper address: https://aclanthology.org/2022.acl-long.50/ Streaming speech translation requires translating real-time speech input into text. Traditional end-to-end speech translation systems generally use a fixed input duration as the basis for judging the reading and writing of the system. This approach faces two major problems: on the one hand, for long speech messages, it cannot guarantee that each input speech message is complete, resulting in reduced translation quality; on the other hand, for short speech messages, it cannot end the reading in advance, resulting in increased translation delay. In order to better judge the reading and writing timing of streaming speech translation, the article introduces a monotonic segmentation module that can detect the boundaries of the audio stream and realize dynamic reading and writing. Experiments have found that the new module comprehensively surpasses previous streaming speech translation models in terms of both latency and quality, while also enabling the model to perform well in non-streaming scenarios. ConclusionWe believe that end-to-end speech translation is not only less complex than the cascade system, but also has greater potential in terms of effect. Based on our previous exploration, we tried to combine data enhancement, multi-task learning, pre-training and other methods to build a Chinese-English and English-Chinese end-to-end speech translation system, which has a good translation effect in daily conversation translation. At the same time, we used LightSeq[10] to improve the model inference speed. The service response time is more than 70% faster than the cascade system. It can currently be experienced in the "Volcano Translation Mini Program-Audio Translation". |
<<: Protect Android user privacy, start with these things
>>: Ctrip's exploration and practice of physical linking technology
Over the past few years, I have been watching and...
Personal Development Training Course Video Lectur...
If new energy vehicles start charging tolls, woul...
If the newly added keywords have not been reviewe...
How to create private domain for offline stores? ...
Compared to iOS, which is a closed system, Androi...
Data is an important component and evaluation cri...
According to the fourth quarter 2017 research rep...
2018 is already last year. The winter was too col...
In the book “It’s Rare in Life to Be Calm”, Mr. L...
The temperature has been over 40 degrees Celsius ...
This article hopes to help entry-level product ma...
This reading note will be divided into several se...
"My dear, I see your forehead is black, I...