PrefaceWith the explosion of concepts such as the metaverse, digital humans, and virtual images, various pan-entertainment applications of digital collaboration and interaction are also being implemented. For example, in some games, players become virtual artists and participate in the daily work of real artists with a high degree of restoration. In certain situations, they will form a strong mapping with virtual artists in terms of facial expressions and other aspects to enhance the sense of participation. The magazine "MO Magazine" jointly "led by the hyper-realistic digital human AYAYI launched by Alibaba Tmall and Jing Boran breaks the traditional flat reading experience and allows readers to have an immersive experience in the form of a combination of virtual and real. In these pan-entertainment application scenarios, "people" must be the first consideration. However, artificially designed digital and animated images are too "abstract", expensive, and lack of personalization. Therefore, in the digitization of faces, we have developed face stylization technology with good control, ID, and stylization to achieve customized style switching of face images. This technology can not only be used as an effective means to create atmosphere and improve viewing experience in entertainment consumption scenarios such as live broadcasts and short videos, but also protect face privacy and add fun in graphic scenes such as buyer shows. Imagine further that if different users gather in a digital community and chat and socialize with digital images in the style of the community (for example, users of the "Battle of the Twin Cities" use the stylized images of the Battle of the Twin Cities to communicate in a friendly manner in the metaverse), it would be a very immersive thing. Battle of the Two Cities animation The left picture shows the original AYAYI image, and the right picture shows the stylized image. In order to apply the face stylization technology to different pan-entertainment business scenarios such as our live broadcast, buyer show, seller show, etc., we have achieved the following:
Next, let's take a look at the demo, and then introduce our entire technical process: Thanks to our product mm——Duofei~ Our overall algorithmic approach uses three phases:
Overall algorithm solution for face stylization editing Of course, you can also use a two-stage solution: StyleGAN creates image pairs, and then directly trains a supervised small model. However, adding an unsupervised image translation stage can decouple the two tasks of stylized data production and paired image data production. By optimizing and improving the algorithms within the stages and the data between the stages, combined with supervised small model training on the mobile end, we can ultimately solve the problems of low-cost stylized model production, style editing and selection, ID sense and stylized tilt, and lightweight deployment models. Data generation based on StyleGANThe work of using the StyleGAN algorithm for data generation mainly focuses on solving three problems:
Below we will expand on these three aspects. ▐ Richness and stylizationThe first important problem encountered by transfer learning based on StyleGAN2-ADA is the trade-off between the richness of the model and the degree of stylization of the model. When using the training set for transfer learning, the richness of the transferred model in terms of facial expressions, facial angles, and facial elements will also be affected by the richness of the training set data; at the same time, as the number of iterations of transfer training increases and the degree of model stylization/FID increases, the model richness will also decrease. This will make the distribution of the stylized data set generated by the subsequent application model too monotonous, which is not conducive to the training of U-GAT-IT. To improve the richness of the model, we made the following improvements:
Fusion method: Swap layer directly exchanges the parameters of different layers, which can easily cause inconsistency in the generated image and bad cases of details; while through smooth model interpolation, better generation effects can be obtained (the following figures are all generated by the fusion model of interpolation fusion method)
Original image, migration model, fusion model ▐ Data generation efficiencyIf we have a rich StyleGAN2 model, how can we generate a style dataset with rich distribution? There are two ways:
Method 1 can provide richer stylized data (especially the richness of the background), while method 2 can improve the effectiveness of generated data and provide a certain degree of distribution control, thereby improving the efficiency of stylized data production. Original image, the hidden vector obtained by StyleGAN Inversion is fed into the "Advanced Face Style/Animation Style" StyleGAN2 generator to obtain the image ▐ Style editing and selection
No, No, No. Each model can not only be used to generate data, but also be accumulated as a basic component and basic capability. Not only do you want to make fine adjustments and optimizations on the original style, but you can also create a new style:
Fine-tune the comic style during the fusion process (pupil color, lips, skin tone, etc.) Through style creation and fine-tuning, models of different styles can be achieved, thereby realizing the production of facial data of different styles. Through transfer learning, style editing optimization, and data generation based on StyleGAN, we can get our first pot of gold: a stylized dataset with high richness, 1024×1024 resolution, and style selection. Paired Data Creation Based on Unsupervised Image TranslationUnsupervised image translation technology can transform images from one domain to another domain by learning the mapping relationship between two domains, thus providing the possibility of making image pairs. For example, the well-known CycleGAN in this field has the following structure: CycleGAN main framework When I discussed “model richness” above, I said:
Why is this? Because the CycleGAN framework requires that the data of the two domains must basically conform to the bijective relationship, otherwise semantic loss will easily occur after the domain is translated into the domain. However, there is a problem with the images generated by StyleGAN2 inversion, that is, most of the background information will be lost, turning into a simple, blurred background (of course, some of the latest papers have greatly alleviated this problem, such as Tencent AI Lab's High-Fidelity GAN Inversion). If U-GAT-IT is directly trained using a dataset and a real face dataset, it is easy for the background of the corresponding image generated by the dataset to lose a lot of semantic information, making it difficult to form an effective image pair. Therefore, we proposed two ways to improve U-GAT-IT to achieve a fixed background: Region-based U-GAT-IT algorithm improvement based on adding background constraints, and Mask U-GAT-IT algorithm improvement based on adding mask branches. These two methods have differences in the strength and balance of ID and stylization. Combined with the adjustment of hyperparameters, they provide a control space for our ID and stylization. At the same time, we also improve the network structure, model EMA, edge enhancement and other means to further improve the generation effect. The left image is the original image, and the middle and right images are the generated results of unsupervised image translation. The difference lies in the control of the algorithm's sense of identity and degree of stylization. Finally, the trained generative model is used to perform inference translation on the real-person image dataset to obtain the corresponding paired stylized dataset. Supervised Image TranslationBased on the research on the computational efficiency of different operators and modules of MNN on mobile terminals, the mobile terminal model structure design and model computational load classification were carried out. Combined with the improvement of CartoonGAN, AnimeGAN, pix2pix and other research, a lightweight, high-definition, and highly stylized mobile terminal model was finally obtained:
* Clarity uses the sum of Laplace gradient values as a statistical indicator Overall training framework of supervised image translation model Realize real-time face-changing effects on mobile devices: Outlook
|
>>: iOS 16 Beta 4 new features and improvements
The former Beats Music CEO, now a senior executiv...
Yesterday, the famous camera evaluation organizat...
This article was reviewed by Pa Li Ze, chief phys...
Planning and production Source: Curious Doctor Re...
Research firm Kantar Worldpanel ComTech (hereinaf...
After a flight of 10 minutes or more, the launch ...
The current vulnerability in iOS 7.1.1 not only in...
[[176552]] Overall, the development of the VR sec...
Albert: Learn English without taking detours Reso...
When you are emotional and say a lot of hurtful t...
Build your own site group from scratch: you only ...
Everyone may have had this experience. Hearing a ...
recently Woman creates tornado in cooking pan It ...
"Spring is here, everything is revived, and ...