SLAM algorithm analysis: grasp the difficulties of visual SLAM and understand the general trend of technology development

SLAM (Simultaneous Localization and Mapping) is recognized by the industry as the cutting-edge direction of spatial positioning technology in the visual field. Its Chinese translation is "Simultaneous Localization and Mapping". It is mainly used to solve the positioning and map construction problems of robots when they move in unknown environments. This time, Zhao Ji, a senior researcher at Yuemian Technology, will also start from the SLAM direction and show you more in-depth technical content.

Zhao Ji: Senior researcher at Yuemian Technology. He received his Ph.D. from Huazhong University of Science and Technology in 2012 and worked as a postdoctoral fellow at the Robotics Institute of CMU from 2012 to 2014. He has worked in Samsung Research Institute on depth cameras, SLAM, and human-computer interaction. He is currently focusing on the research and development of spatial perception technology.

At present, the speed of technological development is very fast. If we want to enhance the user experience in the fields of AR/VR, robots, drones, and driverless cars, we still need more cutting-edge technologies to support it, and SLAM is one of them. In fact, someone once made an analogy that if a mobile phone is separated from WIFI and data network, it is like a driverless car and a robot without SLAM.

SLAM mainly solves the positioning of the camera in space and creates a map of the environment. It can be seen in some of the more popular entrepreneurial directions:

In terms of VR/AR, the superimposed virtual objects are rendered accordingly based on the map obtained by SLAM and the current viewing angle. This can make the superimposed virtual objects look more realistic and without any sense of disobedience.
In the field of drones, SLAM can be used to build local maps to assist drones in autonomous obstacle avoidance and path planning.
In terms of autonomous driving, SLAM technology can be used to provide visual odometer function and then integrated with other positioning methods.
In terms of robot positioning and navigation, SLAM can be used to generate a map of the environment. Based on this map, the robot performs tasks such as path planning, autonomous exploration, and navigation.

The development of SLAM technology has a history of more than 30 years, involving many technical fields. Since it contains many steps, each of which can be implemented using different algorithms, SLAM technology is also a hot research direction in the fields of robotics and computer vision.

A brief analysis of SLAM technology

SLAM stands for Simultaneous Localization and Mapping, which is called "simultaneous positioning and mapping" in Chinese. SLAM attempts to solve the following problem: when a robot moves in an unknown environment, how can it determine its own trajectory by observing the environment and build a map of the environment at the same time? SLAM technology is the sum of many technologies involved in achieving this goal.

SLAM technology covers a wide range. According to different sensors, application scenarios, and core algorithms, there are many ways to classify SLAM. According to different sensors, it can be divided into 2D/3D SLAM based on lidar, RGBD SLAM based on depth camera, visual SLAM based on visual sensor (hereinafter referred to as vSLAM), and visual inertial odometry based on visual sensor and inertial unit (hereinafter referred to as VIO).

2D SLAM based on LiDAR is relatively mature. As early as 2005, Sebastian Thrun et al.'s classic book "Probabilistic Robotics" thoroughly studied and summarized 2D SLAM, and basically determined the framework of LiDAR SLAM. The commonly used Grid Mapping method has a history of more than 10 years. In 2016, Google open-sourced the LiDAR SLAM program Cartographer, which can integrate IMU information and uniformly process 2D and 3D SLAM. At present, 2D SLAM has been successfully applied to sweeping robots.

[[195879]]

RGBD SLAM based on depth cameras has also developed rapidly in the past few years. Since the launch of Microsoft's Kinect, a wave of RGBD SLAM research has been set off. In just a few years, several important algorithms have emerged, such as KinectFusion, Kintinuous, Voxel Hashing, DynamicFusion, etc. Microsoft's Hololens should be integrated with RGBD SLAM, which can achieve very good results in situations where depth sensors can work.

Visual sensors include monocular cameras, binocular cameras, fisheye cameras, etc. Since visual sensors are cheap and can be used both indoors and outdoors, vSLAM is a hot topic of research. Early vSLAMs such as monoSLAM were more of a continuation of filtering methods in the field of robotics. Now more optimization methods are used in the field of computer vision, specifically, bundle adjustment in structure-from-motion. In vSLAM, according to the way visual features are extracted, it can be divided into feature method and direct method. The current representative algorithms of vSLAM include ORB-SLAM, SVO, DSO, etc.

Visual sensors cannot work in textureless areas. The inertial measurement unit (IMU) can measure angular velocity and acceleration through the built-in gyroscope and accelerometer, and then infer the camera's posture, but the inferred posture has cumulative errors. Visual sensors and IMUs are highly complementary, so VIO, which fuses the measurement information of the two, is also a research hotspot. According to the different information fusion methods, VIO can be divided into filtering-based methods and optimization-based methods. Representative algorithms of VIO include EKF, MSCKF, preintegration, OKVIS, etc. Google's Tango tablet has achieved good VIO.

In general, compared with SLAM based on lidar and depth camera, vSLAM and VIO based on visual sensors are not mature enough and are difficult to operate. They usually need to be integrated with other sensors or used in some controlled environments.

Why is Visual LAM difficult?

We make a qualitative analysis by analyzing the measurement information of the sensor. LiDAR or RGBD camera can directly obtain the point cloud of the environment. For a point in the point cloud, it tells us that there is an obstacle point at a certain position and distance. The visual sensor obtains a grayscale image or a color image. For a pixel in the image, it can only tell us that there is an obstacle point at a certain position and the local appearance around the obstacle point, but it cannot tell us the distance of the obstacle point. To calculate the distance of the point, you need to move the camera to a certain position and observe it again, and then make an inference according to the principle of triangulation.

The principle is clear, but it is not easy to do in practice. First, you need to find the correspondence of points in the two images, which involves the extraction and matching of feature points, or the matching between quasi-dense points. With the development of computer vision to this day, there is actually no feature extraction and matching algorithm that satisfies vSLAM in terms of performance and speed. For common feature point extraction algorithms, the performance can be roughly considered as SIFT>SURF>ORB>FAST, and the efficiency can be considered as FAST>ORB>SURF>SIFT (the one on the left of the greater than sign represents better. The performance mainly includes matching accuracy, the number of feature points and spatial distribution, etc.). In order to achieve a compromise between performance and efficiency, FAST or ORB is usually used, and SIFT, SURF, etc. with better performance can only be discarded.

Secondly, the relationship between the image coordinates and the spatial coordinates of the matching points is nonlinear. For example, the correspondence of 2D-2D points satisfies the epipolar geometry, and the correspondence of 2D-3D points satisfies the PnP constraint. There are a large number of these matches, and there are generally dozens to hundreds of matches in the two frames of images. These matches will introduce many constraints, making the relationship between the variables to be estimated complicated. In order to obtain a better estimate, it is usually necessary to establish an optimization problem and optimize multiple variables as a whole. In theory, this is nothing more than a nonlinear least squares optimization problem, but it is not easy to implement because there are nonlinear constraints, a large number of constraints, errors and outliers, and the calculation time must be controlled within the allowable range. At present, key frame technology is widely used, and many methods are used to control the scale of the problem and maintain the sparsity of the problem.

The pie represents the variables to be optimized (camera pose, spatial coordinates of feature points), and the pole represents the constraints (alignment geometry, PnP, etc.). Image source: https://www.pinterest.com/81chevycowper/70s-80s-toys/

Two difficulties of vSLAM were analyzed above. The former makes it difficult to track features on the front end, and the latter makes it difficult to optimize the back end. It is still a very challenging task to make an efficient and robust vSLAM system. In terms of efficiency, SLAM must run in real time. If it cannot be done in real time, it cannot be called SLAM. Without considering real time, the effect of recovering structure from motion (structure-from-motion) will be better. In terms of robustness, a fragile system will lead to a poor user experience and limited functionality.

vSLAM Core Algorithm

The preparatory stage includes the selection of sensors and various calibrations. Since the PTAM algorithm, the framework of Visual SLAM has basically become fixed. It usually includes three threads: the front-end tracking thread, the back-end mapping optimization thread, and the loop closure thread.

The front-end tracking thread mainly involves:

Feature extraction and feature matching;
Knowledge of multi-view geometry, including epipolar geometry, PnP, rigid body motion, Lie algebra, etc.

The backend optimization thread involves nonlinear least squares optimization, which belongs to the content of numerical optimization. The closed-loop detection thread involves location recognition, which is essentially an image retrieval problem. For VIO, it also involves filtering algorithms, state estimation, etc.

After disassembling the SLAM algorithm, we can see that the technology used is relatively traditional. Unlike the currently popular "black box model" of deep learning, each link of SLAM is basically a white box, which can be explained very clearly. However, the SLAM algorithm is not a simple superposition of the above algorithms, but a system engineering with many tradeoffs. If you only run open source programs, you will not have any core competitiveness. Whether you are making products or doing academic research, you should be familiar with various technologies to be creative.

Future Development Trends of SLAM

The development of VSLAM seems to be quite standard. Each link is optimized little by little based on the predecessors, while constantly absorbing the latest achievements in other directions. In the short term, it will definitely continue to improve within the existing framework. As for the long-term trend, IEEE TRO 2016 has a review article Past, present, and future of SLAM: towards the robust-perception age. Several prestigious scholars have made a very good summary of the trend of SLAM in the article. Here I only give some personal thoughts on the points that I am interested in.

The emergence of new sensors will continue to inject vitality into SLAM. If we can directly obtain high-quality raw information, the computing pressure of SLAM can be greatly reduced. For example, in recent years, low-power, high-frame-rate event cameras (also known as dynamic vision systems, DVS) have gradually been used in SLAM. If the cost of such sensors can be reduced, it will bring many changes to the technical landscape of SLAM.

Since deep learning has been invincible in many fields, many researchers have tried to reconstruct the SLAM process using the end-to-end idea of deep learning. At present, some work has tried to replace some aspects of SLAM with deep learning. However, these methods have not shown an overwhelming advantage, and traditional geometric methods are still the mainstream. Under the upsurge of deep learning, all aspects involved in SLAM should gradually absorb the results of deep learning, and the accuracy and robustness will be improved. Perhaps in the future, some aspects of SLAM will be replaced by deep learning as a whole, forming a new framework.

SLAM originally only focused on the geometric information of the environment, but in the future it should be more integrated with semantic information. With the help of deep learning technology, the current object detection and semantic segmentation technologies are developing rapidly, and rich semantic information can be obtained from images. This semantic information can assist in inferring geometric information. For example, the size of a known object is an important geometric clue.

<<: Convolutional neural networks cannot process "graph" structured data? This article tells you the answer

>>: Analysis of the advantages and disadvantages of Docker architecture

The female astronaut will stay in orbit for 6 months. What are the highlights of the Shenzhou 13 mission?

Ledao L60 battery replacement + visual system, Zeekr 7X fast charging + laser system, which car is more worthy of your purchase

Blog

Solve the three major problems of App channel tracking and increase the efficiency of attracting new users by 200%

Shandong college entrance examination facial recognition no longer requires candidates to put fingerprints on the sign-in sheet

Blog

Is this another round of mobile phone manufacturers following suit in grabbing a spot in the VR market?

Blog

There is a kind of relationship on WeChat called the silent "zombie relationship". How to promote it through WeChat marketing?

WeChat promotion, WeChat marketing, how to promote...

iPhone 7 data is inexplicably lost. Why is Apple's frequent problems caused by this?

Apple has been having a hard time recently. The i...

SLAM algorithm analysis: grasp the difficulties of visual SLAM and understand the general trend of technology development

The female astronaut will stay in orbit for 6 months. What are the highlights of the Shenzhou 13 mission?

Mona Cheche 2021 Japanese Girl Illustration Special [HD Quality] Video Course Baidu Cloud Download

Google Android 12 Beta 5 released: many details have been changed, and the official version is coming soon

Ledao L60 battery replacement + visual system, Zeekr 7X fast charging + laser system, which car is more worthy of your purchase

Solve the three major problems of App channel tracking and increase the efficiency of attracting new users by 200%

The Secret of APP Mother and Baby Products Community Operation

Talk about the hidden rules of App operation

The core of Internet car companies is still car manufacturing. NIO may have missed the point.

Shandong college entrance examination facial recognition no longer requires candidates to put fingerprints on the sign-in sheet

Is this another round of mobile phone manufacturers following suit in grabbing a spot in the VR market?

Recommend

Can you play with air like this? Underwater constant pressure compressed air energy storage is here!

Why do I see strange patterns when I rub my eyes?

Use community operation thinking to operate core users!

"Poet Academician" Wang Yuming: Always have a pure heart

Is there a winning marketing technique?

Why does the left eye twitch but not the right?

seo tutorial, what are the advanced search engine commands commonly used in SEO?

WeChat JS-SDK-Use Permission Signature Algorithm

Can I become a great programmer if I learn programming halfway?

Portal website rules, how to do SEO for corporate portal website in 2020?

From 0 to 1, a complete analysis of the key points of APP from launch to promotion

What does the complete closure of 7,148 communities in Wuhan mean? How to buy groceries during the lockdown?

From 0 to 20 billion in just 3 years, all his marketing tricks are here!

There is a kind of relationship on WeChat called the silent "zombie relationship". How to promote it through WeChat marketing?

iPhone 7 data is inexplicably lost. Why is Apple's frequent problems caused by this?