Volcano Engine live4D volumetric video solution selected for SIGGRAPH Asia 2023, supports interactive experience

Volcano Engine live4D volumetric video solution selected for SIGGRAPH Asia 2023, supports interactive experience

Recently, SIGGRAPH Asia 2023 (The 16th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia) was held in Sydney, Australia. The paper from the 3D Video Team of the Volcano Engine Multimedia Laboratory was successfully selected and presented at the conference:

Live4D: A Real-time Capture System for Streamable Volumetric Video (https://dl.acm.org/doi/10.1145/3610543.3626178)

This paper introduces a real-time volumetric video solution that can be transmitted: Live4D. This technology uses deep learning and computer vision technology to synchronously process images from multiple cameras to reconstruct a mesh model with texture information of the captured object, which is then compressed, encoded, and transmitted to each client for rendering and display.

Live4D can configure binocular RGB cameras of different numbers and positions according to application scenarios and accuracy requirements, and can realize volume capture system at a lower cost, and can send reconstructed data to users in real time; it also supports interactive and immersive experience, and users can interact with the video to get a more immersive experience. This technology has broad application prospects in the fields of holographic communication, virtual reality, augmented reality and distance education.

Live4D Showcase

Technical Challenges

Volumetric video can be seen as an upgrade of traditional video. Traditional video plays 30 frames per second, while volumetric video plays 30 3D models per second. Therefore, viewers can freely choose to watch the content in volumetric video from any perspective and any distance (6 degrees of freedom, i.e. 6Dof); they can watch it on mobile phones or computer screens, or through VR/AR glasses.

Some existing volumetric video solutions require hundreds of cameras to capture data simultaneously in their scene settings, which is costly and most real-time reconstruction solutions still have major flaws.

Live4D Solutions

Acquisition of 3D data

In the experimental configuration, the technical team used 10 sets of binocular RGB cameras to synchronously capture full-body data to obtain experimental data. In order to obtain depth information from the camera's perspective, the team adopted a binocular stereo matching method based on deep learning, which is one of the commonly used methods. Since the existing methods could not fully meet the requirements of the technical team in terms of time consumption and quality, the team conducted distillation training based on RAFT-Stereo[1] to obtain more accurate depth in real-time reasoning. At the same time, the technical team also used TensorRT and custom CUDA operators to accelerate the entire framework to achieve the required time and accuracy.

Live4D Pipeline

In order to further improve the depth accuracy of the face area, in binocular stereo matching, the technical team designed a method based on strengthening the region of interest (ROI) to obtain the depth information of the area more finely, and merge it with the original image to obtain a higher quality depth map. The technical team also designed a method for background cutout and depth confidence detection to filter the background and unreliable depth to obtain the final depth map, and send it together with the corresponding RGB to the subsequent reconstruction process.

No ROI

Enhance ROI

TSDF reconstruction and completion

After obtaining RGB and depth maps from multiple perspectives, the technical team constructed a container in space and discretized it into smaller voxels. In order to obtain more refined results, a large number of small voxels are required, but non-ROI areas such as the body will waste more space and computing resources. Therefore, the team adopted a hierarchical data structure to allow the ROI area to have a finer-grained voxel distribution, while each voxel in other areas can have a larger physical size. This can increase the expression of details in the ROI area while reducing resource consumption.

Left: Same Voxel@[8mm], Right: Hierarchy Voxel@[4m

The technical team back-projected the depth maps of multiple perspectives into the container, converted them into truncated signed distances, and converted these depth maps into representations based on truncated signed distances, namely truncated signed distance fields (TSDF) through multi-perspective fusion and geometric consistency checks. Due to problems such as sparse perspectives and self-occlusion in the observation model, the reconstructed TSDF field has many missing areas. Therefore, the technical team proposed a completion method based on TSDF Volume, which uses a three-dimensional deep neural network pre-trained in the constructed dataset to quickly complete the missing areas. Compared with other occupancy-based methods, the team's method has better completion effects, and compared with solutions such as Function4D[2] that extract image information, the speed of the method proposed by the team is also improved.

Comparison of completion methods

Non-rigid tracking

Like video, volumetric video also needs to pursue temporal stability. Since each frame of volumetric video is actually a 3D mesh model containing a texture, the technical team uses non-rigid tracking to maintain the temporal consistency of the triangle mesh.

The technical team expressed the deformation field of the entire surface of the reconstructed object based on embedded deformation key points (Embedded Deformation Nodes (EDNodes)) [3]. The entire calculation process was efficiently calculated by using the LM (Levenberg-Marquard) algorithm on the GPU to solve the local ICP problem. Using the deformation field, the technical team split each voxel into a tetrahedral grid and mixed the TSDF field in the time domain to complete the mean filtering in the time domain, making the implicit surface reconstruction stable in the time domain.

Right: Mesh becomes stable with time sequence" title="Image" style="width: 1045px; visibility: visible;" data-type="block"> Left->Right: Mesh becomes stable with time sequence

Another problem is that tracking methods such as Fusion4D[4] may fail to track when the motion difference between the previous and next frames is too large, which can cause serious errors in the final reconstructed triangle mesh. Specifically, the technical team will evaluate the alignment error of each voxel between the deformed TSDF field and the completed TSDF field. For misaligned voxels, they are considered to be the part where tracking failed. For these voxels, the team trusts the completion results more than the tracking results. This fusion method can also achieve better results in the above-mentioned scenes where the motion difference is too large and the tracking fails.

Comparison of tracking failure and Fusion4D method

Texture Generation

Texture generation needs to solve two problems. One is to calculate the color of any point on the surface of the mesh model; the other is to calculate the mapping of the three-dimensional mesh to the two-dimensional image and store the calculated color on the two-dimensional image for easy transmission and rendering by the graphics pipeline.

Multi-view texture blending

The first step is to calculate the color of the mesh model surface. The technical team uses a multi-view blending algorithm to calculate the texture, and designs a blending weight by comprehensively considering the shading boundary and normal direction, eliminating the color difference, seams and other texture generation quality issues in multi-view blending.

Surface reparameterization picture

At the same time, the technical team designed a parallel and efficient reparameterization algorithm, which presets the orthogonal projection direction by sampling the sphere, and reconstructs the model to perform a depth peeling algorithm to divide the visual level. The labels to which the patches belong are marked by the projection direction and the visual level. The graph-cut algorithm is performed on all candidate labels of all patches to divide the connected domains on the surface of the mesh model. The team constructs a half-edge structure for the mesh model and implements a parallel loopy blief propagation algorithm to optimize and obtain an approximate optimal solution. For all connected domains, the technical team uses a plane reparameterization method and maps and arranges them to obtain the final texture map.

Compression and transmission

The number of directly reconstructed grid points and faces is huge, up to millions of faces. In order to facilitate network transmission and reduce bandwidth usage, the 3D reconstruction results need to be simplified and compressed. In addition to maintaining the original geometric features as much as possible during simplification, in the current scenario, there are requirements for the real-time performance of the simplification algorithm and the geometric features of ROI areas such as the face.

  • The team developed a GPU simplification algorithm with ROI information, which takes multiple groups of edges in parallel to evaluate their quadratic metric errors, and selects the edge with the smallest error in each group for collapse. At the same time, the error level of the edges in the ROI area is increased to reduce the simplification loss of the ROI area.
  • Draco&H265 compression transmission, mesh information synchronized in SEI

Draco[5] was used to further compress the simplified mesh surface and point information into a binary stream using connectivity, quantization, and entropy coding. Texture maps were encoded using H.265. The technical team stored the synchronized mesh information in the SEI of the video stream and reused the existing RTC pipeline to complete the transmission of 3D data.

Frame rate

Overall bandwidth

Mesh Bandwidth

Texture Bandwidth

Algorithm Delay

~30fps

15~20Mbps

~12Mbps

~8Mbps

~100ms

Application Implementation and Outlook

This technical solution supports real-time viewing on a variety of terminal devices and has broad prospects for implementation. For example, it can realize holographic communication based on 3D TV, improve the efficiency and immersion of remote office and remote communication; such as free-perspective live broadcast, strengthen the connection between the anchor and the audience, and create various interactive gameplay; it also provides new media forms in entertainment, education and other scenarios.

About Volcano Engine Multimedia Lab

The Volcano Engine Multimedia Lab is a research team under ByteDance, dedicated to exploring cutting-edge technologies in the multimedia field and participating in international standardization. Its many innovative algorithms and software and hardware solutions have been widely used in the multimedia services of products such as Douyin and Xigua Video, and provide technical services to Volcano Engine's enterprise-level customers. Since the establishment of the lab, many papers have been selected for international top conferences and flagship journals, and have won several international technical competition championships, industry innovation awards and best paper awards.

Reference

[1] RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching, 3DV 2021

[2] Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors, CVPR 2021

[3] Embedded Deformation for Shape Manipulation, SIGGRAPH 2007

[4] Fusion4D: Real-time Performance Capture of Challenging Scenes, ToG 2016

[5] Draco, https://github.com/google/draco

<<:  Android native UI toolkit Jetpack Compose gets version 1.6 update: page scrolling performance improved by 20%, "stable" mark introduced, etc.

>>:  Why is Hongmeng competing with iOS instead of Android?

Recommend

4 channels and strategies for acquiring traffic!

Introduction : In an era of traffic shortage, the...

How can corporate network promotion acquire customers through Baidu Tieba?

Different online promotion methods generally have...

How to create a title with 10w+ views!

British writer Malcolm Gladwell mentioned an idea...

315 is here! Don’t turn crisis PR into a PR crisis!

The annual 315 Gala is coming soon, and for many ...

Castration, sex change... "drug girl" trapped in a man's body

In this world, besides men and women, there is al...

Sewage ditches, garbage pools, farms... Why do scientists go there?

Eupatorium adenophorum Water hyacinth Mirabilis j...

Three reasons why domestic mobile phone systems are tepid

[[156212]] This year's Double Eleven, while s...

Common activity forms and user attraction logic

There are often some popular event cases on the m...

Integrate UniLinks with Flutter (Android AppLinks + iOS UniversalLinks)

Let’s integrate UniLinks using Flutter Mobile and...

Insects are everywhere, they are the real "rulers" of the earth!

The winning works of the 2023 "China Science...

Case analysis: Ele.me’s 10 billion bean event on Double 11

The annual Double 11 carnival is approaching. In ...