Application of face stylization technology on mobile terminals

Application of face stylization technology on mobile terminals

Preface

With the explosion of concepts such as the metaverse, digital humans, and virtual images, various pan-entertainment applications of digital collaboration and interaction are also being implemented. For example, in some games, players become virtual artists and participate in the daily work of real artists with a high degree of restoration. In certain situations, they will form a strong mapping with virtual artists in terms of facial expressions and other aspects to enhance the sense of participation. The magazine "MO Magazine" jointly "led by the hyper-realistic digital human AYAYI launched by Alibaba Tmall and Jing Boran breaks the traditional flat reading experience and allows readers to have an immersive experience in the form of a combination of virtual and real.

In these pan-entertainment application scenarios, "people" must be the first consideration. However, artificially designed digital and animated images are too "abstract", expensive, and lack of personalization. Therefore, in the digitization of faces, we have developed face stylization technology with good control, ID, and stylization to achieve customized style switching of face images. This technology can not only be used as an effective means to create atmosphere and improve viewing experience in entertainment consumption scenarios such as live broadcasts and short videos, but also protect face privacy and add fun in graphic scenes such as buyer shows. Imagine further that if different users gather in a digital community and chat and socialize with digital images in the style of the community (for example, users of the "Battle of the Twin Cities" use the stylized images of the Battle of the Twin Cities to communicate in a friendly manner in the metaverse), it would be a very immersive thing.

Battle of the Two Cities animation

The left picture shows the original AYAYI image, and the right picture shows the stylized image.

In order to apply the face stylization technology to different pan-entertainment business scenarios such as our live broadcast, buyer show, seller show, etc., we have achieved the following:

  1. Low-cost production of models for stylized editing of different faces (all the effects shown in this article were achieved without any investment in design resources);
  2. Appropriate style editing should be carried out to coordinate style selection with design, product, and operation;
  3. Ability to balance the sense of face ID and the degree of stylization;
  4. Ensure the generalization of the model to be applicable to different faces, angles, and scene environments;
  5. While ensuring clarity and other effects, reduce the model's requirements for computing power.

Next, let's take a look at the demo, and then introduce our entire technical process: Thanks to our product mm——Duofei~

Our overall algorithmic approach uses three phases:

  1. Phase 1: Stylized data generation based on StyleGAN;
  2. Stage 2: Unsupervised image translation to generate paired images;
  3. Phase 3: Use paired images to train a supervised mobile image translation model.

Overall algorithm solution for face stylization editing

Of course, you can also use a two-stage solution: StyleGAN creates image pairs, and then directly trains a supervised small model. However, adding an unsupervised image translation stage can decouple the two tasks of stylized data production and paired image data production. By optimizing and improving the algorithms within the stages and the data between the stages, combined with supervised small model training on the mobile end, we can ultimately solve the problems of low-cost stylized model production, style editing and selection, ID sense and stylized tilt, and lightweight deployment models.

Data generation based on StyleGAN

The work of using the StyleGAN algorithm for data generation mainly focuses on solving three problems:

  1. Improve the richness and stylization of the generated data of the model: for example, the generated CG face is more CG-like, and the images of various angles, expressions, hairstyles, etc. are richer;
  2. Improve data generation efficiency: The generated data has a high yield and more controllable distribution;
  3. Style editing and selection: for example, modifying the eye size of a CG face.

Below we will expand on these three aspects.

▐ Richness and stylization

The first important problem encountered by transfer learning based on StyleGAN2-ADA is the trade-off between the richness of the model and the degree of stylization of the model. When using the training set for transfer learning, the richness of the transferred model in terms of facial expressions, facial angles, and facial elements will also be affected by the richness of the training set data; at the same time, as the number of iterations of transfer training increases and the degree of model stylization/FID increases, the model richness will also decrease. This will make the distribution of the stylized data set generated by the subsequent application model too monotonous, which is not conducive to the training of U-GAT-IT.

To improve the richness of the model, we made the following improvements:

  1. Adjust and optimize the data distribution of training data sets;
  2. Model fusion: Because the source model is trained on a large amount of data, the generation space of the source model has a very high richness. If the weights of the low-resolution layers of the migration model are replaced with the weights of the corresponding layers of the source model to obtain a fusion model, the distribution of the generated image of the new model on large elements/features can be made consistent with that of the source model, thereby obtaining the same richness as the source model on low-resolution features;

Fusion method: Swap layer directly exchanges the parameters of different layers, which can easily cause inconsistency in the generated image and bad cases of details; while through smooth model interpolation, better generation effects can be obtained (the following figures are all generated by the fusion model of interpolation fusion method)

  1. Constrain and optimize the learning rates and features of different layers;
  2. Iterative optimization: Manually screen the newly produced data and add it to the original stylized dataset to improve the richness, and then iteratively train and optimize until a model that can generate high richness and satisfactory stylization is obtained.

Original image, migration model, fusion model

▐ Data generation efficiency

If we have a rich StyleGAN2 model, how can we generate a style dataset with rich distribution? There are two ways:

  1. Randomly sample latent variables to generate random style datasets;
  2. Use StyleGAN inversion to input face data that conforms to a certain distribution and create a corresponding style dataset.

Method 1 can provide richer stylized data (especially the richness of the background), while method 2 can improve the effectiveness of generated data and provide a certain degree of distribution control, thereby improving the efficiency of stylized data production.

Original image, the hidden vector obtained by StyleGAN Inversion is fed into the "Advanced Face Style/Animation Style" StyleGAN2 generator to obtain the image

▐ Style editing and selection

The original style is not good-looking, so it can't be used.

The style of the model after migration training cannot be changed

No, No, No. Each model can not only be used to generate data, but also be accumulated as a basic component and basic capability. Not only do you want to make fine adjustments and optimizations on the original style, but you can also create a new style:

  1. Model fusion: By fusing multiple models, setting different fusion parameters/number of layers, and using different fusion methods, you can optimize the inferior style model and adjust the style.
  2. Model nesting: Models of different styles are connected in series so that the final output style carries some style features such as facial features and color tones of the intermediate models.

Fine-tune the comic style during the fusion process (pupil color, lips, skin tone, etc.)

Through style creation and fine-tuning, models of different styles can be achieved, thereby realizing the production of facial data of different styles.

Through transfer learning, style editing optimization, and data generation based on StyleGAN, we can get our first pot of gold: a stylized dataset with high richness, 1024×1024 resolution, and style selection.

Paired Data Creation Based on Unsupervised Image Translation

Unsupervised image translation technology can transform images from one domain to another domain by learning the mapping relationship between two domains, thus providing the possibility of making image pairs. For example, the well-known CycleGAN in this field has the following structure:

CycleGAN main framework

When I discussed “model richness” above, I said:

This (low richness) will make the distribution of stylized datasets generated by subsequent application models too monotonous, which is not conducive to the training of U-GAT-IT.

Why is this? Because the CycleGAN framework requires that the data of the two domains must basically conform to the bijective relationship, otherwise semantic loss will easily occur after the domain is translated into the domain. However, there is a problem with the images generated by StyleGAN2 inversion, that is, most of the background information will be lost, turning into a simple, blurred background (of course, some of the latest papers have greatly alleviated this problem, such as Tencent AI Lab's High-Fidelity GAN Inversion). If U-GAT-IT is directly trained using a dataset and a real face dataset, it is easy for the background of the corresponding image generated by the dataset to lose a lot of semantic information, making it difficult to form an effective image pair.

Therefore, we proposed two ways to improve U-GAT-IT to achieve a fixed background: Region-based U-GAT-IT algorithm improvement based on adding background constraints, and Mask U-GAT-IT algorithm improvement based on adding mask branches. These two methods have differences in the strength and balance of ID and stylization. Combined with the adjustment of hyperparameters, they provide a control space for our ID and stylization. At the same time, we also improve the network structure, model EMA, edge enhancement and other means to further improve the generation effect.

The left image is the original image, and the middle and right images are the generated results of unsupervised image translation. The difference lies in the control of the algorithm's sense of identity and degree of stylization.

Finally, the trained generative model is used to perform inference translation on the real-person image dataset to obtain the corresponding paired stylized dataset.

Supervised Image Translation

Based on the research on the computational efficiency of different operators and modules of MNN on mobile terminals, the mobile terminal model structure design and model computational load classification were carried out. Combined with the improvement of CartoonGAN, AnimeGAN, pix2pix and other research, a lightweight, high-definition, and highly stylized mobile terminal model was finally obtained:

Model

Clarity↑

FID↓

Pixel-wise Loss

3.44

32.53

+Perceptual loss + GAN Loss

6.03

8.36

+Edge-promoting

6.24

8.09

+Data Augmentation

6.57

8.26

* Clarity uses the sum of Laplace gradient values ​​as a statistical indicator

Overall training framework of supervised image translation model

Realize real-time face-changing effects on mobile devices:

Outlook

  1. Optimize data set: image data from different angles, quality optimization;
  2. Optimization, improvement, and redesign of the overall link;
  3. Better data generation: StyleGAN3, Inversion algorithm, model fusion, style editing/creation, few-shot;
  4. Unsupervised two-domain translation: Use highly matched generated data pairs for semi-supervision and optimize the generated model structure (for example, introducing Fourier convolution).
  5. Supervised two-domain translation: vid2vid, improved inter-frame stability, optimization of extreme scenes, and stability of details;
  6. Full image stylization/digital creation: disco diffusion, dalle2, style transfer.

<<:  WeChat installation package has expanded 575 times in 11 years, and 98% of the files are garbage: Why is the size of the App getting bigger and bigger?

>>:  iOS 16 Beta 4 new features and improvements

Recommend

iOS 8.4 will be released tomorrow at 8am

The former Beats Music CEO, now a senior executiv...

Hot sales of iPhone 6 cause Android's global market share to decline

Research firm Kantar Worldpanel ComTech (hereinaf...

What pitfalls might a “star” fall into on his journey?

After a flight of 10 minutes or more, the launch ...

iOS 7.1 reveals another fatal bug

The current vulnerability in iOS 7.1.1 not only in...

How long will it take for VR to enter its spring?

[[176552]] Overall, the development of the VR sec...

Albert: Learn English without taking detours

Albert: Learn English without taking detours Reso...

A "tornado" was created in a cooking pot?! Netizens were excited about the topic

recently Woman creates tornado in cooking pan It ...

How do animals become "love masters" every spring?

"Spring is here, everything is revived, and ...