SLAM (Simultaneous Localization and Mapping) is recognized by the industry as the cutting-edge direction of spatial positioning technology in the visual field. Its Chinese translation is "Simultaneous Localization and Mapping". It is mainly used to solve the positioning and map construction problems of robots when they move in unknown environments. This time, Zhao Ji, a senior researcher at Yuemian Technology, will also start from the SLAM direction and show you more in-depth technical content. Zhao Ji: Senior researcher at Yuemian Technology. He received his Ph.D. from Huazhong University of Science and Technology in 2012 and worked as a postdoctoral fellow at the Robotics Institute of CMU from 2012 to 2014. He has worked in Samsung Research Institute on depth cameras, SLAM, and human-computer interaction. He is currently focusing on the research and development of spatial perception technology. At present, the speed of technological development is very fast. If we want to enhance the user experience in the fields of AR/VR, robots, drones, and driverless cars, we still need more cutting-edge technologies to support it, and SLAM is one of them. In fact, someone once made an analogy that if a mobile phone is separated from WIFI and data network, it is like a driverless car and a robot without SLAM. SLAM mainly solves the positioning of the camera in space and creates a map of the environment. It can be seen in some of the more popular entrepreneurial directions:
The development of SLAM technology has a history of more than 30 years, involving many technical fields. Since it contains many steps, each of which can be implemented using different algorithms, SLAM technology is also a hot research direction in the fields of robotics and computer vision. A brief analysis of SLAM technology SLAM stands for Simultaneous Localization and Mapping, which is called "simultaneous positioning and mapping" in Chinese. SLAM attempts to solve the following problem: when a robot moves in an unknown environment, how can it determine its own trajectory by observing the environment and build a map of the environment at the same time? SLAM technology is the sum of many technologies involved in achieving this goal. SLAM technology covers a wide range. According to different sensors, application scenarios, and core algorithms, there are many ways to classify SLAM. According to different sensors, it can be divided into 2D/3D SLAM based on lidar, RGBD SLAM based on depth camera, visual SLAM based on visual sensor (hereinafter referred to as vSLAM), and visual inertial odometry based on visual sensor and inertial unit (hereinafter referred to as VIO). 2D SLAM based on LiDAR is relatively mature. As early as 2005, Sebastian Thrun et al.'s classic book "Probabilistic Robotics" thoroughly studied and summarized 2D SLAM, and basically determined the framework of LiDAR SLAM. The commonly used Grid Mapping method has a history of more than 10 years. In 2016, Google open-sourced the LiDAR SLAM program Cartographer, which can integrate IMU information and uniformly process 2D and 3D SLAM. At present, 2D SLAM has been successfully applied to sweeping robots.
RGBD SLAM based on depth cameras has also developed rapidly in the past few years. Since the launch of Microsoft's Kinect, a wave of RGBD SLAM research has been set off. In just a few years, several important algorithms have emerged, such as KinectFusion, Kintinuous, Voxel Hashing, DynamicFusion, etc. Microsoft's Hololens should be integrated with RGBD SLAM, which can achieve very good results in situations where depth sensors can work. Visual sensors include monocular cameras, binocular cameras, fisheye cameras, etc. Since visual sensors are cheap and can be used both indoors and outdoors, vSLAM is a hot topic of research. Early vSLAMs such as monoSLAM were more of a continuation of filtering methods in the field of robotics. Now more optimization methods are used in the field of computer vision, specifically, bundle adjustment in structure-from-motion. In vSLAM, according to the way visual features are extracted, it can be divided into feature method and direct method. The current representative algorithms of vSLAM include ORB-SLAM, SVO, DSO, etc. Visual sensors cannot work in textureless areas. The inertial measurement unit (IMU) can measure angular velocity and acceleration through the built-in gyroscope and accelerometer, and then infer the camera's posture, but the inferred posture has cumulative errors. Visual sensors and IMUs are highly complementary, so VIO, which fuses the measurement information of the two, is also a research hotspot. According to the different information fusion methods, VIO can be divided into filtering-based methods and optimization-based methods. Representative algorithms of VIO include EKF, MSCKF, preintegration, OKVIS, etc. Google's Tango tablet has achieved good VIO. In general, compared with SLAM based on lidar and depth camera, vSLAM and VIO based on visual sensors are not mature enough and are difficult to operate. They usually need to be integrated with other sensors or used in some controlled environments. Why is Visual LAM difficult? We make a qualitative analysis by analyzing the measurement information of the sensor. LiDAR or RGBD camera can directly obtain the point cloud of the environment. For a point in the point cloud, it tells us that there is an obstacle point at a certain position and distance. The visual sensor obtains a grayscale image or a color image. For a pixel in the image, it can only tell us that there is an obstacle point at a certain position and the local appearance around the obstacle point, but it cannot tell us the distance of the obstacle point. To calculate the distance of the point, you need to move the camera to a certain position and observe it again, and then make an inference according to the principle of triangulation. The principle is clear, but it is not easy to do in practice. First, you need to find the correspondence of points in the two images, which involves the extraction and matching of feature points, or the matching between quasi-dense points. With the development of computer vision to this day, there is actually no feature extraction and matching algorithm that satisfies vSLAM in terms of performance and speed. For common feature point extraction algorithms, the performance can be roughly considered as SIFT>SURF>ORB>FAST, and the efficiency can be considered as FAST>ORB>SURF>SIFT (the one on the left of the greater than sign represents better. The performance mainly includes matching accuracy, the number of feature points and spatial distribution, etc.). In order to achieve a compromise between performance and efficiency, FAST or ORB is usually used, and SIFT, SURF, etc. with better performance can only be discarded. Secondly, the relationship between the image coordinates and the spatial coordinates of the matching points is nonlinear. For example, the correspondence of 2D-2D points satisfies the epipolar geometry, and the correspondence of 2D-3D points satisfies the PnP constraint. There are a large number of these matches, and there are generally dozens to hundreds of matches in the two frames of images. These matches will introduce many constraints, making the relationship between the variables to be estimated complicated. In order to obtain a better estimate, it is usually necessary to establish an optimization problem and optimize multiple variables as a whole. In theory, this is nothing more than a nonlinear least squares optimization problem, but it is not easy to implement because there are nonlinear constraints, a large number of constraints, errors and outliers, and the calculation time must be controlled within the allowable range. At present, key frame technology is widely used, and many methods are used to control the scale of the problem and maintain the sparsity of the problem. The pie represents the variables to be optimized (camera pose, spatial coordinates of feature points), and the pole represents the constraints (alignment geometry, PnP, etc.). Image source: https://www.pinterest.com/81chevycowper/70s-80s-toys/ Two difficulties of vSLAM were analyzed above. The former makes it difficult to track features on the front end, and the latter makes it difficult to optimize the back end. It is still a very challenging task to make an efficient and robust vSLAM system. In terms of efficiency, SLAM must run in real time. If it cannot be done in real time, it cannot be called SLAM. Without considering real time, the effect of recovering structure from motion (structure-from-motion) will be better. In terms of robustness, a fragile system will lead to a poor user experience and limited functionality. vSLAM Core Algorithm The preparatory stage includes the selection of sensors and various calibrations. Since the PTAM algorithm, the framework of Visual SLAM has basically become fixed. It usually includes three threads: the front-end tracking thread, the back-end mapping optimization thread, and the loop closure thread. The front-end tracking thread mainly involves:
The backend optimization thread involves nonlinear least squares optimization, which belongs to the content of numerical optimization. The closed-loop detection thread involves location recognition, which is essentially an image retrieval problem. For VIO, it also involves filtering algorithms, state estimation, etc. After disassembling the SLAM algorithm, we can see that the technology used is relatively traditional. Unlike the currently popular "black box model" of deep learning, each link of SLAM is basically a white box, which can be explained very clearly. However, the SLAM algorithm is not a simple superposition of the above algorithms, but a system engineering with many tradeoffs. If you only run open source programs, you will not have any core competitiveness. Whether you are making products or doing academic research, you should be familiar with various technologies to be creative. Future Development Trends of SLAM The development of VSLAM seems to be quite standard. Each link is optimized little by little based on the predecessors, while constantly absorbing the latest achievements in other directions. In the short term, it will definitely continue to improve within the existing framework. As for the long-term trend, IEEE TRO 2016 has a review article Past, present, and future of SLAM: towards the robust-perception age. Several prestigious scholars have made a very good summary of the trend of SLAM in the article. Here I only give some personal thoughts on the points that I am interested in. The emergence of new sensors will continue to inject vitality into SLAM. If we can directly obtain high-quality raw information, the computing pressure of SLAM can be greatly reduced. For example, in recent years, low-power, high-frame-rate event cameras (also known as dynamic vision systems, DVS) have gradually been used in SLAM. If the cost of such sensors can be reduced, it will bring many changes to the technical landscape of SLAM. Since deep learning has been invincible in many fields, many researchers have tried to reconstruct the SLAM process using the end-to-end idea of deep learning. At present, some work has tried to replace some aspects of SLAM with deep learning. However, these methods have not shown an overwhelming advantage, and traditional geometric methods are still the mainstream. Under the upsurge of deep learning, all aspects involved in SLAM should gradually absorb the results of deep learning, and the accuracy and robustness will be improved. Perhaps in the future, some aspects of SLAM will be replaced by deep learning as a whole, forming a new framework. SLAM originally only focused on the geometric information of the environment, but in the future it should be more integrated with semantic information. With the help of deep learning technology, the current object detection and semantic segmentation technologies are developing rapidly, and rich semantic information can be obtained from images. This semantic information can assist in inferring geometric information. For example, the size of a known object is an important geometric clue. |
>>: Analysis of the advantages and disadvantages of Docker architecture
On the Internet, most people who work with traffi...
Hello, everyone. This is the villager's life....
Typhoon Saura, the ninth typhoon this year At aro...
During the holidays, a reader asked me a question...
The account structure of search promotion consist...
Each traffic platform, including Taobao, WeChat, ...
In the process of search advertising , many corpo...
Resource introduction of equity incentive and equ...
Last weekend, the "2013 China Internet Networ...
High conversion costs may be a stumbling block in...
I hope this article will be helpful to those who ...
Learn Bazi from scratch, learn Bazi Jiugongge fro...
"Hahaha, I can't stop, I can't stop ...
The "LeWan" mobile phone is a high-cost-...
1. Maros Sefcovic, Vice President of the European...