One morning in 2014, Val Kilmer woke up and found himself in a pool of blood. The only thing wrong with his body was a lump in his throat, which made it difficult for him to swallow. He was soon diagnosed with laryngeal cancer causing his vomiting blood. To treat it, he had to undergo a tracheotomy. The surgery left a hole in his throat, and he needed a tube to eat. From then on, "Breathe or eat?" became a binary question. Val Kilmer's rivalry with Tom Cruise in Top Gun | Source: Looper A more serious consequence for the Hollywood actor who played Batman in the 1995 film, is that he lost his voice. Now, when he tries to speak, he can only make a sound between "a squeak and a growl". Last year, Kilmer decided to work with artificial intelligence company Sonantic to restore his "ability to speak." With limited recordings, they successfully cloned a voice that was very similar to Kilmer's before he became ill, and will be able to speak on his behalf in the future. The technology of AI synthesizing human voices is already very mature. Some mainstream platforms that are open for free testing, such as Resemble AI and Descript, only require you to record 25 sentences or 10 minutes of recording to clone your voice - of course, the longer the training set, the more similar the model will be to you. The minimum requirement? 3.7 seconds is enough. In addition to serving patients like Kilmer, voice cloning has another great use, which is to "resurrect old friends", whether they are deceased relatives or deceased celebrities. Not long ago, the speech synthesis company Play.ht released a podcast episode in which Jobs talked with the famous podcast host Joe Rogan - the text and voice were all synthesized by AI. The "fakes" chat and laugh in the podcast, and the real person does not need to participate from beginning to end. Does this involve copyright infringement? Especially for those who have passed away, who owns the ownership of their voices? Can anyone use it? The more difficult question is, how to distinguish the real from the fake? It’s a good technology, but it is used for fraud? Don’t say you can definitely tell the difference between human voice and AI. In March 2019, an employee of a British energy company received a call from his boss, asking him to transfer 220,000 euros to a supplier in Hungary within an hour. The "boss" on the other end of the phone had a slight German accent, and his voice was exactly the same as his usual boss's. He immediately did it without any doubt. After the transfer, the money was quickly transferred to Mexico and was difficult to recover. In 2020, a bank manager in Hong Kong was deceived by a cloned voice and approved a transfer of 35 million US dollars to the scammers. This is becoming more and more common. A VMware survey this year showed that two-thirds of the companies surveyed said that the fraud attacks they received in the past year had audio or video forgery elements. When you hear a familiar voice on the phone, most people "haven't built up the muscle memory to really deal with it," said Lisa O'Connor, managing director at Accenture Security. Physiologically, the human brain becomes dumb when faced with false voices. A 2019 UC Riverside study found clear differences in people’s brain scans when they looked at an authentic and a fake Rembrandt painting; the same was not true when they listened to Morgan Freeman, a robot Freeman, and an impersonator speak. “The results suggest that humans may be inherently unable to distinguish between real and non-real sounds.” There is no significant difference in human brain activity when listening to real and synthetic voices | Source: Paper illustration Is AI invincible in fake human voice? Scientists are trying to deal with it. In a recent study, researchers at the University of Florida discovered a flaw in the machine: it has no vocal tract. In other words, the limitation of human vocalization lies in the structure of each person's vocal tract, and AI does not have such "limitations". For decades, scientists have tried to recreate the sounds of prehistoric creatures. What would the roars and cries of mammoths, dinosaurs, etc. sound like? The shape of the bones provides many clues. For example, in the case of Parasaurolophus, there is a long cavity in their skull, which scientists use to estimate its resonant frequency. The same is true for human voice production, which uses various structures of the vocal tract, vocal cords, tongue, and lips to work together to squeeze air to produce and change sounds. Using acoustic and fluid dynamics models, researchers can infer which structure produces this sound. It will usually look something like this, an irregular pathway with bumps and valleys. The degree of opening of the mouth determines the sound we make | Source: Screenshot of the paper However, when they fed machine-generated sounds into the same model, something strange happened: The red circle shows the machine’s “voice channel structure” | Source: Screenshot of the paper The vocal tract of the robot voice is like a long and thin straw, which is completely different from the normal structure of the human body. Just by looking at the side anatomical diagram, you can almost immediately tell whether it is a human voice or a machine voice. Using this method, they tested 4,966 audio segments with an accuracy rate of 99.9%. Imagine that this may soon become a basic configuration. When you answer a call, an additional plug-in will start running at the same time to determine whether the voice on the other end is a real person or a machine-synthesized voice, and then issue a warning to you. Many people are already working on this. In 2019, in order to combat cloned voices and fake audio, Google released a synthetic speech database to promote research on fake audio detection. It contains thousands of phrases "spoken" by Google's deep learning model, using 68 different voices to cover a variety of accents, hoping to encourage the outside world to develop more voice authentication solutions. Without the tools of scientists, what would we do on our own? There are some tips, but it’s mostly based on intuition. Pindrop, a voice authentication service company, has been developing synthetic voices, and in the process they have also discovered some flaws in the machine: It is not good at handling fricative sounds, such as f, s, v, z, because the software has difficulty distinguishing them from noise. If you like to drag out the sound, it is difficult for the algorithm to distinguish the end of the word from the background noise in the recording, which may cause sentence break problems Too "clean", like it was recorded in a studio with professional equipment, and the quality is consistent Pindrop has also found some exceptionally “smart” criminals who, in order to cover up these flaws, deliberately put in noisy ambient sounds to interfere with the other party’s judgment. One scammer they call “chicken man” always plays the sound of roosters in the background; another woman uses the sound of a crying baby as background music to try to gain the other party’s sympathy. Therefore, if you hear a constant strange noise coming from the other side, be careful. For conversations involving high-stakes transactions, Henry Adjed, director of deepfake detection company Deeptrace, has a practical suggestion: Consider using code to conduct the conversation, or asking or answering a secret question at the beginning of the call. With the current AI learning speed, I believe that these clumsy flaws will soon be broken one by one - a research paper once found that the irregularity of a person's blinking can be used to determine whether a video is a deep fake. But just a few months later, the developer solved this bug. But at least now, humans can still judge that the other party is not of the same kind through minor clues. For example, in the conversation between Logan and Jobs, there is always a weird laugh interspersed in the fluent dialogue, "Hehe, hehe", which is very abrupt and the tone of voice will be distorted. This reminds me of Resemble, which gives you some options after speech generation, such as adding pauses or emotions such as "anger" and "joy" to the paragraph. Judging from the feedback, the model does not seem to be able to handle emotions well yet. But one day, we will doubt everything. A few days ago, my colleague Xiao Yang received a sales call. He turned on the speakerphone and enthusiastically discussed with other people in the office whether the other party was a robot. Suddenly, a voice came from the other end of the phone: "I'm sorry, you misunderstood the way I spoke." "Do you believe this is a real person?" He replied: "Huh, I don't believe it, this must be a trick of AI." References [1] https://www.ndss-symposium.org/wp-content/uploads/2019/02/ndss2019_08-3_Neupane_paper.pdf [2] https://theconversation.com/deepfake-audio-has-a-tell-researchers-use-fluid-dynamics-to-spot-artificial-imposter-voices-189104 [3] https://www.nytimes.com/2020/05/06/magazine/val-kilmer.html [4] https://www.yahoo.com/entertainment/val-kilmer-cancer-treatment-lost-voice-142401511.html [5] https://www.hellomagazine.com/healthandbeauty/health-and-fitness/20210825120419/val-kilmer-heartbreaking-reveal-cancer-diagnosis/ [7] https://www.howtogeek.com/682865/audio-deepfakes-can-anyone-tell-if-they-are-fake/ [8] https://senseient.com/wp-content/uploads/Deepfakes-updated.pdf [9] https://mitsloan.mit.edu/ideas-made-to-matter/deepfakes-explained Author: Weng Yan Editor: Lying insect Guokr ( ID : Guokr42 ) If you need to reprint, please contact [email protected] Welcome to forward to your circle of friends Source : Guokr |
<<: 4,870 varieties, why does Peru collect so many potatoes?
Improve transportation and open up the "main...
The Internet industry and the financial industry ...
1. Create topics that attract a lot of attention....
[[152797]] WatchKit App Architecture A WatchKit a...
In the field of smart hardware, small and medium-...
1. Event Planning Overview There are three keys t...
Writing copy is like practicing Kung Fu. If you w...
When the food was served, I was about to pick up ...
Imagine an anniversary event, just like your own ...
Before the analysis, let me briefly explain my de...
July 1 The Communist Party of China celebrates it...
Review | Ruan Guangfeng, Deputy Director of Kexin...
In the blink of an eye, it’s already the Little N...
The marketing industry is changing so fast that i...