Is AI invincible in fake human voice?

One morning in 2014, Val Kilmer woke up and found himself in a pool of blood. The only thing wrong with his body was a lump in his throat, which made it difficult for him to swallow.

He was soon diagnosed with laryngeal cancer causing his vomiting blood. To treat it, he had to undergo a tracheotomy. The surgery left a hole in his throat, and he needed a tube to eat. From then on, "Breathe or eat?" became a binary question.

Val Kilmer's rivalry with Tom Cruise in Top Gun | Source: Looper

A more serious consequence for the Hollywood actor who played Batman in the 1995 film, is that he lost his voice. Now, when he tries to speak, he can only make a sound between "a squeak and a growl".

Last year, Kilmer decided to work with artificial intelligence company Sonantic to restore his "ability to speak." With limited recordings, they successfully cloned a voice that was very similar to Kilmer's before he became ill, and will be able to speak on his behalf in the future.

The technology of AI synthesizing human voices is already very mature. Some mainstream platforms that are open for free testing, such as Resemble AI and Descript, only require you to record 25 sentences or 10 minutes of recording to clone your voice - of course, the longer the training set, the more similar the model will be to you. The minimum requirement? 3.7 seconds is enough.

In addition to serving patients like Kilmer, voice cloning has another great use, which is to "resurrect old friends", whether they are deceased relatives or deceased celebrities. Not long ago, the speech synthesis company Play.ht released a podcast episode in which Jobs talked with the famous podcast host Joe Rogan - the text and voice were all synthesized by AI.

The "fakes" chat and laugh in the podcast, and the real person does not need to participate from beginning to end. Does this involve copyright infringement? Especially for those who have passed away, who owns the ownership of their voices? Can anyone use it?

The more difficult question is, how to distinguish the real from the fake?

It’s a good technology, but it is used for fraud?

Don’t say you can definitely tell the difference between human voice and AI.

In March 2019, an employee of a British energy company received a call from his boss, asking him to transfer 220,000 euros to a supplier in Hungary within an hour. The "boss" on the other end of the phone had a slight German accent, and his voice was exactly the same as his usual boss's. He immediately did it without any doubt. After the transfer, the money was quickly transferred to Mexico and was difficult to recover. In 2020, a bank manager in Hong Kong was deceived by a cloned voice and approved a transfer of 35 million US dollars to the scammers.

This is becoming more and more common. A VMware survey this year showed that two-thirds of the companies surveyed said that the fraud attacks they received in the past year had audio or video forgery elements.

When you hear a familiar voice on the phone, most people "haven't built up the muscle memory to really deal with it," said Lisa O'Connor, managing director at Accenture Security.

Physiologically, the human brain becomes dumb when faced with false voices.

A 2019 UC Riverside study found clear differences in people’s brain scans when they looked at an authentic and a fake Rembrandt painting; the same was not true when they listened to Morgan Freeman, a robot Freeman, and an impersonator speak.

“The results suggest that humans may be inherently unable to distinguish between real and non-real sounds.”

There is no significant difference in human brain activity when listening to real and synthetic voices | Source: Paper illustration

Is AI invincible in fake human voice?

Scientists are trying to deal with it.

In a recent study, researchers at the University of Florida discovered a flaw in the machine: it has no vocal tract. In other words, the limitation of human vocalization lies in the structure of each person's vocal tract, and AI does not have such "limitations".

For decades, scientists have tried to recreate the sounds of prehistoric creatures. What would the roars and cries of mammoths, dinosaurs, etc. sound like? The shape of the bones provides many clues. For example, in the case of Parasaurolophus, there is a long cavity in their skull, which scientists use to estimate its resonant frequency.

The same is true for human voice production, which uses various structures of the vocal tract, vocal cords, tongue, and lips to work together to squeeze air to produce and change sounds. Using acoustic and fluid dynamics models, researchers can infer which structure produces this sound.

It will usually look something like this, an irregular pathway with bumps and valleys.

The degree of opening of the mouth determines the sound we make | Source: Screenshot of the paper

However, when they fed machine-generated sounds into the same model, something strange happened:

The red circle shows the machine’s “voice channel structure” | Source: Screenshot of the paper

The vocal tract of the robot voice is like a long and thin straw, which is completely different from the normal structure of the human body. Just by looking at the side anatomical diagram, you can almost immediately tell whether it is a human voice or a machine voice. Using this method, they tested 4,966 audio segments with an accuracy rate of 99.9%.

Imagine that this may soon become a basic configuration. When you answer a call, an additional plug-in will start running at the same time to determine whether the voice on the other end is a real person or a machine-synthesized voice, and then issue a warning to you.

Many people are already working on this. In 2019, in order to combat cloned voices and fake audio, Google released a synthetic speech database to promote research on fake audio detection. It contains thousands of phrases "spoken" by Google's deep learning model, using 68 different voices to cover a variety of accents, hoping to encourage the outside world to develop more voice authentication solutions.

Without the tools of scientists, what would we do on our own?

There are some tips, but it’s mostly based on intuition.

Pindrop, a voice authentication service company, has been developing synthetic voices, and in the process they have also discovered some flaws in the machine:

It is not good at handling fricative sounds, such as f, s, v, z, because the software has difficulty distinguishing them from noise.

If you like to drag out the sound, it is difficult for the algorithm to distinguish the end of the word from the background noise in the recording, which may cause sentence break problems

Too "clean", like it was recorded in a studio with professional equipment, and the quality is consistent

Pindrop has also found some exceptionally “smart” criminals who, in order to cover up these flaws, deliberately put in noisy ambient sounds to interfere with the other party’s judgment. One scammer they call “chicken man” always plays the sound of roosters in the background; another woman uses the sound of a crying baby as background music to try to gain the other party’s sympathy.

Therefore, if you hear a constant strange noise coming from the other side, be careful.

For conversations involving high-stakes transactions, Henry Adjed, director of deepfake detection company Deeptrace, has a practical suggestion: Consider using code to conduct the conversation, or asking or answering a secret question at the beginning of the call.

With the current AI learning speed, I believe that these clumsy flaws will soon be broken one by one - a research paper once found that the irregularity of a person's blinking can be used to determine whether a video is a deep fake. But just a few months later, the developer solved this bug.

But at least now, humans can still judge that the other party is not of the same kind through minor clues. For example, in the conversation between Logan and Jobs, there is always a weird laugh interspersed in the fluent dialogue, "Hehe, hehe", which is very abrupt and the tone of voice will be distorted.

This reminds me of Resemble, which gives you some options after speech generation, such as adding pauses or emotions such as "anger" and "joy" to the paragraph. Judging from the feedback, the model does not seem to be able to handle emotions well yet.

But one day, we will doubt everything.

A few days ago, my colleague Xiao Yang received a sales call. He turned on the speakerphone and enthusiastically discussed with other people in the office whether the other party was a robot.

Suddenly, a voice came from the other end of the phone: "I'm sorry, you misunderstood the way I spoke."

"Do you believe this is a real person?"

He replied: "Huh, I don't believe it, this must be a trick of AI."

References

[1] https://www.ndss-symposium.org/wp-content/uploads/2019/02/ndss2019_08-3_Neupane_paper.pdf

[2] https://theconversation.com/deepfake-audio-has-a-tell-researchers-use-fluid-dynamics-to-spot-artificial-imposter-voices-189104

[3] https://www.nytimes.com/2020/05/06/magazine/val-kilmer.html

[4] https://www.yahoo.com/entertainment/val-kilmer-cancer-treatment-lost-voice-142401511.html

[5] https://www.hellomagazine.com/healthandbeauty/health-and-fitness/20210825120419/val-kilmer-heartbreaking-reveal-cancer-diagnosis/
[6] https://arstechnica.com/information-technology/2022/10/fake-joe-rogan-interviews-fake-steve-jobs-in-an-ai-powered-podcast/

[7] https://www.howtogeek.com/682865/audio-deepfakes-can-anyone-tell-if-they-are-fake/

[8] https://senseient.com/wp-content/uploads/Deepfakes-updated.pdf

[9] https://mitsloan.mit.edu/ideas-made-to-matter/deepfakes-explained

Author: Weng Yan

Editor: Lying insect

Guokr ( ID : Guokr42 )

If you need to reprint, please contact [email protected]

Welcome to forward to your circle of friends

Source : Guokr

<<: 4,870 varieties, why does Peru collect so many potatoes?

>>: The elderly in the family are suffering from "yang phobia". How should children guide them to resolve the problem?

Do you always charge your battery to 100%? Stop it now.

The number of such patients is the highest during the Spring Festival every year! Huaxi Emergency Hospital experts teach you a trick that can save lives in an emergency

Blog

A well-known singer used hair testing to prove his innocence. Why is hair testing called the "magic mirror" for drug-related personnel?

Blog

Summer mosquito prevention: from home to outdoors

Is AI invincible in fake human voice?

Do you always charge your battery to 100%? Stop it now.

Mid-Autumn Festival Special: I want to go home and see ____!

Tips for attracting new users on Pinduoduo APP!

The number of such patients is the highest during the Spring Festival every year! Huaxi Emergency Hospital experts teach you a trick that can save lives in an emergency

A well-known singer used hair testing to prove his innocence. Why is hair testing called the "magic mirror" for drug-related personnel?

Summer mosquito prevention: from home to outdoors

Are the dancing stars too boring? Or maybe too fat...

Can I receive express delivery in Hangzhou now? Can I receive express delivery passing through Hangzhou?

An insider's perspective on why Google founded a new company, Alphabet

Tips for setting titles and covers for Kuaishou short videos!

Recommend

What’s the advantage of Durex’s operating model?

Southern Ox Year Fortune Cracked Version, Southern Ox Year Fortune 2020 Cracked Version

Why does iPhone 6 insist on 1GB of memory?

The "Traffic Management 12123" applet is installed in Alipay and enables cross-provincial inquiries for the first time

Is the fruit and vegetable washing machine just a waste of money?

For voice navigation, do you prefer Lao Luo or Lin Chiling?

Learn mathematical modeling and MATLAB programming in 7 days

If parents are short, will their children also be short? Here are the answers to your questions about height!

There are 13 ways to play activities in the 2020 event planning plan

How to accurately explore user needs?

In-depth forecast for the second half of 2018 new media: growth panic, traffic depression

Some Alphabet subsidiaries may return to China in advance

Did Bai Juyi also suffer from depression? Is it the same as what we are talking about today?

This murderous thing is so close to you...

What are the impressions, clicks, click-through rates, and rankings that appear in traffic and keyword tools?