Ten years after my grandpa passed away, I used AI to "resurrect" him

Ten years after my grandpa passed away, I used AI to "resurrect" him

I used my grandfather’s written records and audio-visual materials, and integrated several mature AI technologies to “resurrect” him.

That day, I had a sudden idea and searched for "using AI to resurrect the deceased" on the search engine, and saw the story of Joshua "resurrecting" his fiancée Jessica.

In 2012, Jessica's condition worsened while waiting for a liver transplant, and she died despite failed rescue efforts. Joshua happened to be away at the time and missed her death, and he blamed himself for eight years. Until 2020, he saw "Project December", a website that prompted him to fill in "sentence samples" and "character introductions" to generate a customized chat AI.

Joshua imported text messages sent by his late wife into the website, and then he began to describe Jessica: Born in 1989, a free-spirited Libra... and particularly superstitious...

Joshua and "Jessica" start chatting丨sfchronicle.com

After the page was refreshed, "Jessica" was ready. She could answer all of Joshua's questions and even described her as "talking with her hands on her face." Joshua said, "My mind tells me that this is not the real Jessica, but feelings cannot be controlled by reason." After chatting for an unknown amount of time, he burst into tears and fell asleep.

I have a deep understanding of this irreparable regret. Ten years ago, my grandpa was dying, and I ran away from high school to see him in a hurry, and then I was sent back to school - this was the last time I saw my grandpa. Every time I think of it, it feels like a lump in my throat. I really want to see him again and talk to him for a few more words.

I am now a programmer, dealing with AI and algorithms every day, and I can't help but start to wonder: Can the current AI technology be integrated together to eventually achieve an effect that is extremely close to my grandfather in terms of language expression and human form. So I started searching and found many people who had the same wish as me, and some people put it into practice.

A Korean mother met her daughter who had passed away three years ago in a VR movie | MBC Korea

A Korean mother was in great pain after her seven-year-old daughter passed away. After hearing the news, a TV team spent eight months creating a three-dimensional virtual image of the girl, allowing the mother and daughter to meet in a VR scene. In my opinion, this is more like animation production. The image of the girl and the scene are more "cartoon-like". In addition, the girl cannot interact with people more intelligently and can only follow a fixed script.

Some people want a tangible "entity" and entrust relevant companies to scan the three-dimensional features of the human body and then make a silicone bionic man. However, this solution requires very high customization costs. In addition, people who are buried cannot provide human body data.

The aforementioned Project December can only create a text chat robot. I want to create a "grandfather" with a concrete and tangible image, preferably one that is more realistic.

“He has memories, can interact with me, can speak, and his face looks like my grandfather.” This bold idea became clearer and clearer, and I began to search for AI papers that might be useful.

Be the brain of "grandfather" first

The reason why Project December can generate characters with specific personalities based on seed text is because it is connected to the GPT-3 API. GPT-3 is OpenAI's commercial language model, which can be simply understood as giving computers the ability to "think like humans."

GPT-3 can even say some words that are “superior to humans”:

Humans: What is the purpose of life?

AI: Life is a beautiful miracle. It evolves over time into a greater form of beauty. In a sense, the purpose of life is to increase this beauty in the universe.

The reason it has this capability is that engineers feed the model with data, more than 300 billion texts. After reading so much text, the AI ​​model begins to mine (that is, find patterns) the relationship between words and sentences, and then gives the most appropriate answer based on the current context.

I imported my grandfather's text information into the GPT model丨Guokr Graphics

I started preparing the seed text to be imported into GPT-3, scanned the letters I had saved into text, sorted out the chat messages that had been synchronized to the cloud, and dug out the words my grandfather had said in the video: "This fish should be braised. It cost more than 80 yuan to buy it steamed. The taste is light (Hangzhou dialect means "light") and tasteless." "Stop taking pictures with your phone and help your brother serve the dishes."

After importing GPT-3, it can start to imitate my grandfather's language style and conversation ideas... Wait, GPT-3 is charged. However, I quickly found the free and open source GPT-J and started training.

Language model training is the process of "guessing words". The model uses graphics card parallel computing to find the relationship between each word and sentence in a corpus, such as what is the most likely next word after a word appears. The GPT-J team has open-sourced the pre-training model, which can already achieve most of the functions. All I need to do is convert the seed text into word units, and then throw this grandfather's proprietary corpus to GPT-J for learning.

A general deep learning model needs to be trained for several days and nights. This time, I used GPT-J to learn new corpus, which was not particularly time-consuming and only took six hours.

Six hours later, I quietly typed "hello" on the screen.

Let Grandpa Speak

"Hello, grandson."

The AI ​​"grandfather" started chatting with me. After a few brief text exchanges, I thought of the very mature "TTS" (text-to-speech) technology. Voice broadcasts on navigation apps and text recitations on short video apps all use TTS.

I just need to copy the conversation of "grandfather", add an audio clip containing grandpa's voice and intonation, and feed it to the TTS model for learning. The final output will be: the machine reads out my grandpa's conversation, and it is in his accent.

I found a TTS model Tacotron 2 created by Google. It first packages your input text and voice together, then deeply mines the hidden mapping relationship between text and voice, and then packages it into pure voice output.

Tacotron 2 is an end-to-end model. I don't need to pay attention to the encoding layer, decoding layer, attention layer, post-processing and other structures in it. Its structure is all integrated together. For me, it is like a tool that can "generate" results with one click. I just need to input text and... Just as I was about to start, I realized the problem: this model only has specific announcers to choose from, and does not support specified voices.

At this point, I thought of the "voice cloning" technology, which is based on Tacotron and superimposed with the ability of "transfer learning", that is, it could only do this job before, but now it can adapt to the environment, so it can also do other jobs. It can directly replace the voice of the voice actor with my grandfather's voice, just like cloning his voice.

After some research, I found a voice cloning model called "MockingBird", which can directly synthesize Chinese text and speech and output the voice I want. It can clone any Chinese voice within 5 seconds and use this voice to synthesize new content.

"Grandpa" reads out the text he outputs in his own voice丨Guokr Drawing

The moment I heard "Grandpa" speak, I felt that the puzzle pieces of my memory were being patched together piece by piece.

Excited, I started to prepare the appearance of "Grandpa". I usually work as an image algorithm engineer, and I am relatively good at image technology, but my professional intuition also told me that the next face generation is not that easy.

Driving the face with voice

The most direct way to make my grandfather "appear" is to build a three-dimensional customized virtual portrait, but this requires collecting human body data points, and it is obvious that this approach is not feasible.

Combining the existing photos, voice and video materials at hand, I began to think: Is it possible to generate a lifelike human face using only a video and a string of voice?

After many twists and turns, I found the solution of "Neural Voice Puppetry", which is a "facial reenactment" technology. I only need to give it the dialogue audio, and it can generate an animation of the human face and mouth shape synchronized with the audio.

The authors used convolutional neural networks to find the relationship between facial appearance, facial emotion rendering, and speech, and then used this learned relationship to render frame-by-frame facial videos that can read speech. However, the only drawback of this solution is that we cannot specify the output character, we can only choose a given character, such as Obama.

After I finished it, I realized that I had to change my face.

So what I actually got was a video of Obama speaking in my grandfather's voice. My next step is to use AI to swap faces.

I finally chose to use the technology mentioned in the paper HeadOn: Real-time Reenactment of Human Portrait Videos. The related application is the popular virtual anchor: capturing the expression of the person in the video and driving the face of the two-dimensional character.

The people who provide expression information are usually real people, but because the "Obama" I generated before was very realistic, I can directly use it to drive my grandfather's portrait.

In this way, I used my grandfather’s communication records and a few audio and video materials before his death, integrated several mature AI technologies, and “resurrected” him.

Because the entire process is a model-to-model operation, the result of model A is used as the input of model B, and the output of model B is the input of model C, so it takes several minutes or even longer to generate a result. Therefore, it is impossible to achieve the effect of "grandfather" having a video conversation with me. It is more like after I said something, he replied to me with a short VCR after computer calculation.

My "grandfather" is all about calculation formulas

When I saw the "grandfather" on the screen who was both familiar and unfamiliar, my thoughts began to waver.

Technology has become so powerful that I can "resurrect" the deceased by combining the results of a few AI papers, but I can still immediately understand the difference between my grandfather and "grandfather". The latter has no way to understand human emotions, and his response and empathy are just simulated results. Computers can give the answers that humans want without understanding the content of the questions.

I can greet the person on the screen and talk about our recent situation, but he has no memory and we are just like two strangers chatting with each other. Obviously, this is not the grandfather who complained that "the fish tastes bland".

Perhaps in the future, people with withered bodies can retrieve memories and back up consciousness, or live in a virtual environment like the Matrix in The Matrix. Only then can we escape the separation of life and death together.

Photo by Compare Fiber on Unsplash

In order to save operating costs, Project December set up a points system for each chat AI, and those points are like the lifespan of the AI. When "Jessica" was about to die, Joshua took the initiative to interrupt the communication with her because he didn't want to see her experience a second death.

In the months since "Jessica" came along, Joshua says his eight years of shame seem to be melting away. I feel the same way.

Resurrection and retention are both impossible, but after chatting with these "emotional" AIs and even meeting them, I emotionally feel that my grandfather and I seem to have had a solemn farewell.

References

[1] https://www.sfchronicle.com/projects/2021/jessica-simulation-artificial-intelligence/

[2] https://slate.com/technology/2020/05/meeting-you-virtual-reality-documentary-mbc.html

[3] https://link.springer.com/article/10.1007/s11023-020-09548-1

[4] https://github.com/minnershubs/MockingBird-V.5.0-VOICE-CLONER

[5] https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b

[6] https://github.com/minnershubs/MockingBird-V.5.0-VOICE-CLONER

[7] https://arxiv.org/pdf/1912.05566.pdf%22

[8] https://arxiv.org/pdf/1805.11729.pdf

Author: Yu Jialin

Editor: biu

Illustration: Chen Qi

This article comes from Guokr and may not be reproduced without permission.

If necessary, please contact [email protected]

<<:  A woman used a massager to lose weight and her kidneys were "flattened" and she was sent to the ICU. Don't try these 11 weight loss methods!

>>:  A textbook on escaping danger during the 5.1 magnitude earthquake in Yibin, Sichuan

Recommend

How dependent are we on technology? And why should programmers care?

Let's answer the second question first, which...

Introduction to iQIYI’s effective promotion products

There is no doubt that iQiyi has taken the top sp...

An introduction to Kugou advertising promotion styles and material specifications!

Kugou is Tencent’s first social karaoke applicati...

Can you become a brand by selling goods online?

Topic: "Can online sales become a brand?&quo...

Practical application of Internet finance: Who touched your promotion fees?

If you are attracting new customers for an Intern...

Common problems and optimization strategies for mobile networks

When we start to pay attention to the user experi...