Speech recognition is not a difficult task, speech synthesis is the challenge

I believe most people are familiar with voice assistants. Many people have also had conversations with Siri, the voice assistant in iOS, whether they are just playing with it or really need its help.

It's not difficult for Siri to understand what you say, but if you really want to have a conversation with it, you will definitely feel weird. Regardless of whether it can answer your questions correctly, the voice it uses to reply to you will make you feel that you are not chatting with a person.

Indeed, in the matter of speech recognition, the best companies at home and abroad have been able to achieve an accuracy rate of about 95%. However, in terms of speech generation, almost no company can make the robot speak the same as human speech. Even for some simple phrases, you can tell whether it is machine-generated or real-person.

But as more and more people use voice interaction, how to make the computer sound more human has become a big challenge facing many software companies and programmers.

According to the New York Times, IBM spent 18 months at the turn of the century getting its Watson robot to speak, but despite its intelligence, Watson's ability to speak was poor because it didn't sound like a human voice at all.

[[162883]]

Michael Picheny, senior manager of IBM Labs. Image from The New York Times

Nowadays, computer voices are all synthesized by machines (except for some weather forecasts and navigation prompts which are completely recorded manually). The real voice database used to synthesize the final voice is usually very large. The database contains the real pronunciation of a certain word, the pronunciation of the word in different tones, and even the partial pronunciation of the word. A voice actor usually needs at least 10 hours to complete the entry of a voice database.

Although the voice database is already very large, it is still difficult to synthesize speech close to real people. The biggest difficulty is to make the synthesized voice have human emotions. Alan Black, a computer scientist at the Language Technology Institute of Carnegie Mellon University, told the New York Times that they have no way to tell the speech synthesizer that this passage should be read with emotion.

Of course, designers often emphasize that they do not want to use synthetic voice to deceive people into thinking it is real voice. But they still hope that the voice interaction between machines and people can be more natural and more like communication between people.

In fact, if the pronunciation of a machine is too close to that of a real person, it will make people feel uncomfortable. In 1970, Japanese robotics scientist Masahiro Mori published an article titled "The Uncanny Valley", the core of which is that when robots are too similar to humans, even the slightest flaw in the robot will make people feel uneasy.

According to Masahiro Mori's hypothesis, as the degree of anthropomorphism of human objects increases, humans' emotional response to them shows an increase-decrease-increase curve. The uncanny valley is when the robot reaches a "close to human" similarity, and human favorability suddenly drops to the range of disgust. "Active humanoids" have a greater range of changes than "static humanoids". Image from Wikipedia

ToyTalk is a company that makes human voices for children's toys. Its CEO Brian Langner said that when a machine can do some things right, people will think it can do everything right. So in his products, he will let the machine make some mistakes intentionally. After all, he makes toys, and there is nothing wrong with making some mistakes to make people laugh.

The problem now is that after so many scientists’ efforts, we don’t need to worry about the arrival of the “uncanny valley” when it comes to synthesized speech.

In order to make Watson "speak properly", IBM recruited 25 voice actors. After a lot of experiments and adjustments, they finally synthesized a voice that sounded more comfortable - although people could still clearly tell that it was not a real person speaking.

If voice interaction is to develop rapidly, synthetic speech must be more comfortable for people to hear. Otherwise, this kind of interaction can only be described as voice input and machine execution, and there is no real communication between humans and machines.

<<: Apple's business is declining? The opposite may be true

>>: The 5th Global Mobile Game Conference IP Matchmaking Conference King IP gives new vitality to the game

Breakthrough Academy Online Course Monetization Training Camp: Turn knowledge/experience/skills into money in 10 days

Blog

How much does it cost to inject hyaluronic acid into the nose? Chongqing SEO website promotion is not popular, how to make spiders favor it

Blog

UX Lessons Learned from Seven Classic Fairy Tales!

Recommend

How much does it cost to make the Yibin tattoo and embroidery mini program? What is the price of developing Yibin tattoo and embroidery program?

Yibin Embroidery Mini Program investment promotio...

Speech recognition is not a difficult task, speech synthesis is the challenge

Breakthrough Academy Online Course Monetization Training Camp: Turn knowledge/experience/skills into money in 10 days

How much does it cost to inject hyaluronic acid into the nose? Chongqing SEO website promotion is not popular, how to make spiders favor it

UX Lessons Learned from Seven Classic Fairy Tales!

Kuaishou advertising, Kuaishou advertising account opening

Detailed explanation of the application process for Douyin Enterprise Blue V certification!

Analysis of the 618 Brand Marketing Sales Explosion Strategy

My understanding of APP operation and promotion!

JD.com’s CFO announced his retirement. Who is JD.com’s CFO?

Li Yuan Introduction: Where is the nofollow tag applicable?

The untold secrets of game development

Recommend

How much does it cost to make the Yibin tattoo and embroidery mini program? What is the price of developing Yibin tattoo and embroidery program?

How much does it cost to develop a Lhasa underwear mini program? Lhasa underwear applet development price inquiry

Yuan Hai: "Case Study: MUJI: Concept-driven Business"

User growth: How to retain new users?

5 steps + 7 tips to teach you how to create high-conversion information flow copy!

How to find cooperation channels with anchors who sell products?

How to make scrolling subtitles for Tik Tok short videos? Specific operation steps diagram

How does Baidu bidding analyze the market?

In-depth understanding of viewport in mobile front-end development

The “New World” of Mobile Applications Going Global

3 suggestions for novice merchants to attract traffic to the live broadcast room

How to plan a hit event?

Huawei executive: It will take many years to develop its own alternative to Google

Introduction to Baidu search promotion paid advertising!

Apple iOS 9 may integrate smart home applications