Recently, SIGGRAPH Asia 2023 (The 16th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia) was held in Sydney, Australia. The paper from the 3D Video Team of the Volcano Engine Multimedia Laboratory was successfully selected and presented at the conference: Live4D: A Real-time Capture System for Streamable Volumetric Video (https://dl.acm.org/doi/10.1145/3610543.3626178) This paper introduces a real-time volumetric video solution that can be transmitted: Live4D. This technology uses deep learning and computer vision technology to synchronously process images from multiple cameras to reconstruct a mesh model with texture information of the captured object, which is then compressed, encoded, and transmitted to each client for rendering and display. Live4D can configure binocular RGB cameras of different numbers and positions according to application scenarios and accuracy requirements, and can realize volume capture system at a lower cost, and can send reconstructed data to users in real time; it also supports interactive and immersive experience, and users can interact with the video to get a more immersive experience. This technology has broad application prospects in the fields of holographic communication, virtual reality, augmented reality and distance education. Live4D Showcase Technical ChallengesVolumetric video can be seen as an upgrade of traditional video. Traditional video plays 30 frames per second, while volumetric video plays 30 3D models per second. Therefore, viewers can freely choose to watch the content in volumetric video from any perspective and any distance (6 degrees of freedom, i.e. 6Dof); they can watch it on mobile phones or computer screens, or through VR/AR glasses. Some existing volumetric video solutions require hundreds of cameras to capture data simultaneously in their scene settings, which is costly and most real-time reconstruction solutions still have major flaws. Live4D SolutionsAcquisition of 3D dataIn the experimental configuration, the technical team used 10 sets of binocular RGB cameras to synchronously capture full-body data to obtain experimental data. In order to obtain depth information from the camera's perspective, the team adopted a binocular stereo matching method based on deep learning, which is one of the commonly used methods. Since the existing methods could not fully meet the requirements of the technical team in terms of time consumption and quality, the team conducted distillation training based on RAFT-Stereo[1] to obtain more accurate depth in real-time reasoning. At the same time, the technical team also used TensorRT and custom CUDA operators to accelerate the entire framework to achieve the required time and accuracy. Live4D Pipeline In order to further improve the depth accuracy of the face area, in binocular stereo matching, the technical team designed a method based on strengthening the region of interest (ROI) to obtain the depth information of the area more finely, and merge it with the original image to obtain a higher quality depth map. The technical team also designed a method for background cutout and depth confidence detection to filter the background and unreliable depth to obtain the final depth map, and send it together with the corresponding RGB to the subsequent reconstruction process.
TSDF reconstruction and completionAfter obtaining RGB and depth maps from multiple perspectives, the technical team constructed a container in space and discretized it into smaller voxels. In order to obtain more refined results, a large number of small voxels are required, but non-ROI areas such as the body will waste more space and computing resources. Therefore, the team adopted a hierarchical data structure to allow the ROI area to have a finer-grained voxel distribution, while each voxel in other areas can have a larger physical size. This can increase the expression of details in the ROI area while reducing resource consumption. Left: Same Voxel@[8mm], Right: Hierarchy Voxel@[4m The technical team back-projected the depth maps of multiple perspectives into the container, converted them into truncated signed distances, and converted these depth maps into representations based on truncated signed distances, namely truncated signed distance fields (TSDF) through multi-perspective fusion and geometric consistency checks. Due to problems such as sparse perspectives and self-occlusion in the observation model, the reconstructed TSDF field has many missing areas. Therefore, the technical team proposed a completion method based on TSDF Volume, which uses a three-dimensional deep neural network pre-trained in the constructed dataset to quickly complete the missing areas. Compared with other occupancy-based methods, the team's method has better completion effects, and compared with solutions such as Function4D[2] that extract image information, the speed of the method proposed by the team is also improved. Comparison of completion methods Non-rigid trackingLike video, volumetric video also needs to pursue temporal stability. Since each frame of volumetric video is actually a 3D mesh model containing a texture, the technical team uses non-rigid tracking to maintain the temporal consistency of the triangle mesh. The technical team expressed the deformation field of the entire surface of the reconstructed object based on embedded deformation key points (Embedded Deformation Nodes (EDNodes)) [3]. The entire calculation process was efficiently calculated by using the LM (Levenberg-Marquard) algorithm on the GPU to solve the local ICP problem. Using the deformation field, the technical team split each voxel into a tetrahedral grid and mixed the TSDF field in the time domain to complete the mean filtering in the time domain, making the implicit surface reconstruction stable in the time domain. Right: Mesh becomes stable with time sequence" title="Image" style="width: 1045px; visibility: visible;" data-type="block"> Left->Right: Mesh becomes stable with time sequence Another problem is that tracking methods such as Fusion4D[4] may fail to track when the motion difference between the previous and next frames is too large, which can cause serious errors in the final reconstructed triangle mesh. Specifically, the technical team will evaluate the alignment error of each voxel between the deformed TSDF field and the completed TSDF field. For misaligned voxels, they are considered to be the part where tracking failed. For these voxels, the team trusts the completion results more than the tracking results. This fusion method can also achieve better results in the above-mentioned scenes where the motion difference is too large and the tracking fails. Comparison of tracking failure and Fusion4D method Texture GenerationTexture generation needs to solve two problems. One is to calculate the color of any point on the surface of the mesh model; the other is to calculate the mapping of the three-dimensional mesh to the two-dimensional image and store the calculated color on the two-dimensional image for easy transmission and rendering by the graphics pipeline. Multi-view texture blending The first step is to calculate the color of the mesh model surface. The technical team uses a multi-view blending algorithm to calculate the texture, and designs a blending weight by comprehensively considering the shading boundary and normal direction, eliminating the color difference, seams and other texture generation quality issues in multi-view blending. Surface reparameterization picture At the same time, the technical team designed a parallel and efficient reparameterization algorithm, which presets the orthogonal projection direction by sampling the sphere, and reconstructs the model to perform a depth peeling algorithm to divide the visual level. The labels to which the patches belong are marked by the projection direction and the visual level. The graph-cut algorithm is performed on all candidate labels of all patches to divide the connected domains on the surface of the mesh model. The team constructs a half-edge structure for the mesh model and implements a parallel loopy blief propagation algorithm to optimize and obtain an approximate optimal solution. For all connected domains, the technical team uses a plane reparameterization method and maps and arranges them to obtain the final texture map. Compression and transmissionThe number of directly reconstructed grid points and faces is huge, up to millions of faces. In order to facilitate network transmission and reduce bandwidth usage, the 3D reconstruction results need to be simplified and compressed. In addition to maintaining the original geometric features as much as possible during simplification, in the current scenario, there are requirements for the real-time performance of the simplification algorithm and the geometric features of ROI areas such as the face.
Draco[5] was used to further compress the simplified mesh surface and point information into a binary stream using connectivity, quantization, and entropy coding. Texture maps were encoded using H.265. The technical team stored the synchronized mesh information in the SEI of the video stream and reused the existing RTC pipeline to complete the transmission of 3D data.
Application Implementation and OutlookThis technical solution supports real-time viewing on a variety of terminal devices and has broad prospects for implementation. For example, it can realize holographic communication based on 3D TV, improve the efficiency and immersion of remote office and remote communication; such as free-perspective live broadcast, strengthen the connection between the anchor and the audience, and create various interactive gameplay; it also provides new media forms in entertainment, education and other scenarios.
About Volcano Engine Multimedia LabThe Volcano Engine Multimedia Lab is a research team under ByteDance, dedicated to exploring cutting-edge technologies in the multimedia field and participating in international standardization. Its many innovative algorithms and software and hardware solutions have been widely used in the multimedia services of products such as Douyin and Xigua Video, and provide technical services to Volcano Engine's enterprise-level customers. Since the establishment of the lab, many papers have been selected for international top conferences and flagship journals, and have won several international technical competition championships, industry innovation awards and best paper awards. Reference[1] RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching, 3DV 2021 [2] Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors, CVPR 2021 [3] Embedded Deformation for Shape Manipulation, SIGGRAPH 2007 [4] Fusion4D: Real-time Performance Capture of Challenging Scenes, ToG 2016 [5] Draco, https://github.com/google/draco |
>>: Why is Hongmeng competing with iOS instead of Android?
Introduction : In an era of traffic shortage, the...
Different online promotion methods generally have...
British writer Malcolm Gladwell mentioned an idea...
The annual 315 Gala is coming soon, and for many ...
In this world, besides men and women, there is al...
Eupatorium adenophorum Water hyacinth Mirabilis j...
In the Japanese drama "Quartet", which ...
What is your main reason for drinking coffee? Is ...
[[156212]] This year's Double Eleven, while s...
There are often some popular event cases on the m...
Let’s integrate UniLinks using Flutter Mobile and...
The winning works of the 2023 "China Science...
Sylvester Stallone's Rambo 5 movies collectio...
At present, the elderly population in my country ...
The annual Double 11 carnival is approaching. In ...