Why doesn’t your voice assistant speak like a human? About the principles and challenges of TTS technology

Why doesn’t your voice assistant speak like a human? About the principles and challenges of TTS technology

Every tough guy dreams of running into Samantha, the robot girlfriend in the movie "Her", on his way home from work. Although "you can only hear her voice but not see her", you can feel the interpretation of various emotions just by listening to her voice.

[[272998]]

The real voice behind Samantha comes from Scarlett Johansson. Some people say, "Just listening to her voice has satisfied all my fantasies about her."

It can be said that sound is crucial to eliminating the barriers between people and machines and narrowing the distance between them.

But in real life, the voices spoken by AI voice assistants are still far from our ideal voices.

Why doesn't your robot girlfriend speak like Scarlett Johansson? Today, Zheng Jiewen, a speech synthesis algorithm engineer at Rokid A-Lab, will talk about speech synthesis technology and analyze the reasons.

[[272999]]

The technical principles behind TTS - front-end and back-end systems

The technology that allows voice assistants to speak is called TTS (text-to-speech), which is speech synthesis.

Creating natural, real, and pleasant TTS is what scientists and engineers in the field of AI have been working towards. However, there are always various obstacles in the process. What are they? Let's start with the basic principles of TTS.

TTS technology essentially solves the problem of "converting text into speech", allowing machines to speak.

Figure 1 Speech synthesis, a problem of converting text to speech

But this process is not easy. In order to reduce the difficulty of machine understanding, scientists split the conversion process into two parts - the front-end system and the back-end system.

Figure 2: TTS composed of front-end and back-end

The front end is responsible for converting the input text into an intermediate result, and then sending this intermediate result to the back end, which generates sound.

Next, let's take a look at how the front-end and back-end systems work together.

Front-end system for generating "linguistic specifications"

When we were young, we needed to learn pinyin before we could recognize words. With pinyin, we can use it to spell words we don’t know. For TTS, the intermediate result converted from the text by the front-end system is like pinyin.

However, pinyin alone is not enough, because we need to read aloud not one word at a time, but one sentence at a time. If a person cannot use the intonation to control the rhythm of his speech correctly, it will make people uncomfortable and even misunderstand what the speaker wants to convey. Therefore, the front end needs to add this intonation information to tell the back end how to "speak" correctly.

We call this rhythmic information prosody. Prosody is a very comprehensive information. To simplify the problem, prosody is broken down into information such as pauses and stress. Pauses tell the backend where to stop when reading a sentence, and stress tells which part to emphasize when reading. All this information combined together can be called a "linguistic specification."

Figure 3. The front end generates a "language specification" to tell the back end what kind of content we want to synthesize.

The front end is like a linguist. It performs various analyses on the plain text given to it, and then provides a specification sheet to the back end, telling it what kind of sound should be synthesized.

In an actual system, in order for the machine to speak correctly, this "specification" is much more complicated than what we describe here.

The back-end system that plays the role of "pronouncer"

When the back-end system receives the "linguistic specification", the goal is to generate sounds that are as consistent as possible with the description in this specification.

Of course, machines cannot generate a sound directly out of thin air. Before that, we still need to record several to dozens of hours of audio data in the recording studio (the amount of data used will vary depending on the technology), and then use this data to build the backend system.

Currently, there are two mainstream back-end systems: one is based on waveform splicing, and the other is based on parameter generation.

The idea behind the waveform splicing method is very simple: that is to store the pre-recorded audio on the computer. When we want to synthesize the sound, we can find the audio clips that best suit the specification according to the "specification sheet" issued by the front end, and then splice the clips together one by one to form the final synthesized speech.

For example, if we want to synthesize the sentence “You are so pretty”, we will search for audio clips of the four words “you, really, good, pretty” in the database, and then splice these four clips together.

Figure 4 uses the splicing method to synthesize "You are so beautiful"

Of course, the actual splicing is not that simple. First, we need to choose the granularity of the splicing unit. After choosing the granularity, we also need to design the splicing cost function.

The principles of the parameter generation method and the waveform splicing method are very different. The system using the parameter generation method directly uses mathematical methods to first summarize the most obvious features of the audio from the audio, and then use a learning algorithm to learn a converter how to map the front-end linguistic specification to these audio features.

Once we have this converter from linguistic specifications to audio features, when synthesizing the same four words "you are so beautiful", we first use this converter to convert the audio features, and then use another component to restore these audio features to the sound we can hear. In the professional field, this converter is called an "acoustic model", and the component that converts sound features into sound is called a "vocoder".

Why doesn't your AI voice assistant speak like a human?

If we simply give an answer to this question, there are two main reasons:

Your AI will make mistakes. In order to synthesize a voice, AI needs to make a series of decisions. If these decisions go wrong, the final synthesized voice will have problems, have a strong sense of mechanicalness, and sound unnatural. Both the front-end system and the back-end system of TTS may make mistakes.

When using AI to synthesize voices, engineers oversimplify the problem, resulting in an inaccurate description of the voice generation process. This simplification comes from the fact that we humans do not have enough understanding of language and human voice generation; on the other hand, it also comes from the fact that commercial voice synthesis systems must consider cost control when operating.

Next, let’s talk specifically about the front-end and back-end errors that cause AI voice assistants to speak unnaturally.

Front-end error

The front-end system, as a linguist, is the most complex part of the entire TTS system. In order to generate the final "linguistic specification" from plain text, this linguist has to do much more than we imagine.

Figure 5 Typical front-end processing flow

A typical front-end processing flow is:

Text structure analysis

When we input a text into the system, the system must first determine what language the text is in. Only by knowing the language can it know how to process it next. Then it divides the text into sentences one by one. These sentences are then sent to the following modules for processing.

Text regularization

In the Chinese scenario, the purpose of text regularization is to convert punctuation marks or numbers that are not Chinese characters into Chinese characters.

For example, “this operation is 666”, the system needs to convert “666” into “six six six”.

Text to Phoneme

That is, converting text into pinyin. Due to the existence of polyphones in Chinese, we cannot directly find the pronunciation of a word by looking it up in the Xinhua Dictionary. We must use other auxiliary information and some algorithms to correctly decide how to read it. These auxiliary information include word segmentation and the part of speech of each word.

Prosody prediction

It is used to determine the rhythm of a sentence, that is, the intonation. However, the general simplified system only predicts the pause information in the sentence, that is, whether a pause is needed after reading a word and how long the pause should be.

From the above four steps, we can see that any step may go wrong. Once an error occurs, the generated linguistic specification will be wrong, resulting in errors in the synthesized sound at the back end. For a TTS system, typical front-end errors are of the following types:

1. Text regularization error

Since our writing form is different from our reading form, we need to convert the writing form into the form we actually read in the very early stage of the front end. This process is called "text regularization" in the professional field. For example, the "666" we mentioned earlier

It is easy to feel the error of text regularization in TTS system. For example, the following sentence:

"I spent 666 yuan and stayed in a room numbered 666." (Click to listen to the audio)

We know that the first "666" should be read as "六百六十六" and the second "666" should be read as "六六六". However, the TTS system can easily make mistakes.

Another example: "I think there is a 2-4 chance. The score is 2-4."

Should these two "2-4" be read as "二到四", "两到四", or "二比四"? You should be able to tell the correct reading at a glance. However, for the front-end system, this is another difficult problem.

2. Phonetic Errors

Chinese is a profound language, but it is not easy to read it correctly. One of the more difficult problems is, when faced with polyphonic characters, which pronunciation should be chosen?

For example, in these two sentences: "My hair has grown longer again." and "My hair is long." Should the word "长" here be pronounced as "chang" with the second tone or "zhang" with the fourth tone?

Of course, people can easily pick the correct answer. What about the following sentence:

If a person is capable, he can do anything he wants. If a person is not capable, he can do anything he wants. (Click to listen to the audio)

You may have to think a little bit to read all the "lines" in between. It's even more difficult for AI.

From time to time, you may hear the AI ​​assistant make mistakes when pronouncing polyphonetic characters. Such mistakes can be easily caught by your ears and give you an immediate impression: "This is definitely not a real person speaking~".

Of course, polyphonetic errors are only one type of phonetic errors. There are other errors, such as light tone, erhua sound, tone change, etc. In short, it is not easy to accurately let your AI assistant read all the content.

3. Rhythm Errors

As mentioned above, in order to convey information more accurately, people need to have a sense of rhythm when speaking. If a person does not pause in the middle of a sentence, it will be difficult for us to understand what he means, and we may even think that he is impolite. Our scientists and engineers are trying every means to make TTS read more rhythmically and politely. However, in many cases, the performance of TTS is always unsatisfactory.

This is because language changes so much that the rhythm of our reading is different depending on the context or even the occasion. The most important thing in rhythm is to discuss the pause rhythm of a sentence, because the pause is the basis for the correct reading of a sentence. If the pause is not correct, the error will be easily caught by the ear.

For example, this sentence: "Switch to single loop mode for you". If we use "|" to indicate a pause, then the pause rhythm of a normal person reading aloud is generally like this: "Switch to single loop mode for you".

But if your AI assistant says something like "Switch to single song loop mode for you" with a strange rhythm, you may feel devastated.

Backend Error

After talking about the "linguist who often makes mistakes", let's take a look at the back end: the "pronouncer" who reads the manuscript according to the "specification" given by the "linguist".

As mentioned above, there are two main backend methods: splicing and parameter methods. Currently, Apple and Amazon's AI assistants Siri and Alexa use the waveform splicing method. In China, most companies use the parameter method. So let's focus on the possible backend errors of the parameter method.

After receiving the language information from the front-end, the first thing the back-end system has to do is to determine how long each Chinese character should be pronounced (even how long each initial consonant and final should be pronounced). This component that determines the length of pronunciation is called a "duration model" in the professional field.

With this time information, the backend system can convert this linguistic specification into audio features through a converter (also called an acoustic model) as mentioned above. Then another component called a "vocoder" is used to restore these audio features into sound. From the duration model to the acoustic model, and then to the vocoder, each step may make mistakes or fail to perfectly generate the results we want.

In a TTS system, typical backend errors are of the following types:

1. Duration model error

When reading a sentence, the pronunciation time of each word is different depending on the context. The TTS system must decide which words should be pronounced longer and which words should be pronounced shorter based on the context. A typical example is the pronunciation of modal particles.

Usually, these modal particles carry the speaker's tone and emotion, so their pronunciation is longer than that of ordinary words, such as this sentence: "Well... I think he is right." (Click to listen to the audio)

The "hmm" here, in this scenario, obviously needs to be prolonged to express a "judgment after thinking."

But not all "um"s need to be dragged out this long, for example: "Hmm? What did you say just now?"

The "嗯" here represents a questioning tone, and its pronunciation is much shorter than the "嗯" in the above sentence. If the duration model cannot correctly determine the pronunciation duration, it will give people an unnatural feeling.

2. Acoustic model error

The main acoustic model error is that the "speaker" encountered pronunciations that have not been seen during training. The role of the acoustic model is to learn the speech acoustic features corresponding to various "linguistic specifications" from the training sound library. If the machine encounters linguistic expressions that have not been seen during the training process during synthesis, it will be difficult for the machine to output the correct acoustic features.

A common example is erhua. In principle, each Chinese pinyin has a corresponding erhua, but in actual speech, some erhua are used very rarely, so when recording a sound library, not all erhua are usually covered, but only the most common ones are retained. At this time, some erhua cannot be pronounced or cannot be pronounced well.

3. Vocoder Error

There are many types of vocoders, but the more traditional and common vocoders usually use fundamental frequency information. What is fundamental frequency? The fundamental frequency is the speed at which your vocal cords vibrate when you speak. Here is a simple way to feel the fundamental frequency of your speech: press your other four fingers except your thumb to your throat, and then start talking to yourself.

At this time, you will feel your throat vibrating, and this vibration information is our fundamental frequency information. When a voiced sound is pronounced, the vocal cords will vibrate. The sound produced when the vocal cords do not vibrate is called an unvoiced sound. Consonants can be unvoiced or voiced, and vowels are generally voiced. Therefore, the positions of vowels and voiced consonants in the synthesized speech should correspond to fundamental frequencies. If the fundamental frequency output by the acoustic model mentioned above deviates, the sound synthesized by the vocoder will sound strange.

When training the backend "speaker", we also need to use algorithms to calculate the fundamental frequency information. A bad fundamental frequency extraction algorithm may cause fundamental frequency loss, frequency doubling or frequency halving. These will directly affect the effect of the fundamental frequency prediction model. If the fundamental frequency is not predicted where it should be, the synthesized sound will sound hoarse, which has a significant impact on the listening experience.

A good vocoder must also handle the relationship between the fundamental frequency and harmonics. If the high-frequency harmonics are too obvious, it will cause a buzzing sound and a clear mechanical feel.

Summarize

In this article, we introduced the basic principles of TTS and analyzed the reasons why voice assistants cannot speak like real people: TTS makes mistakes in making various decisions, resulting in incorrect or unnatural reading. At the same time, in order to allow computers to synthesize sounds, engineers will simplify the text-to-speech problem, resulting in an inaccurate portrayal of the sound generation process. This simplification comes from the cognitive limitations of the speech language generation process on the one hand, and is also limited by current computing tools.

Although there are many new methods in this field, especially the use of deep learning methods to directly convert text to speech, and very natural voices have been demonstrated, it is still a very challenging task to make your AI assistant speak completely like a human.

About the author: Zheng Jiewen, the author of this article, holds a master's degree in artificial intelligence from the University of Edinburgh and studied under Professor Simon King, an internationally renowned speech synthesis expert. He is currently a speech synthesis algorithm engineer at Rokid ALab, responsible for the design of speech synthesis engine architecture and backend acoustic model development.

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<:  Is 5G real or fake? A picture to help you understand the knowledge about NSA and SA

>>:  5G mobile phone inventory: Friends who want to try it out should take a look

Recommend

Prepare for Double Eleven, direct e-commerce holiday marketing plan!

This article shares all aspects of direct-operate...

How to quickly customize and develop a WeChat applet? How much does it cost?

How to quickly customize and develop a WeChat app...

Samsung faces a lame situation: mobile phone addiction drags down performance

Lee Kun-hee, the helmsman of Samsung Group, is st...

From 0 to 1, make a 100-like Xiaohongshu note!

How to continue to gain traffic for Xiaohongshu N...

Let’s go, “group” to find the shale “underground palace”

In ancient times, people conducted hydrogeologica...

Is your home broadband signal always insufficient? Click here!

With the popularization of the Internet, routers ...

Event promotion and operation: How to conduct a complete event review?

Whether it is an online or offline activity, whet...

Illustrations | Home and business gas safety knowledge cards

Gas leaks can easily lead to explosions How to us...