Andreas Geiger, CVPR PAMI Young Researcher Winner, discussed the Computer Vision in Autonomous Driving
Professor Andreas Geiger, head of the automatic vision group at Eberhard Karls Universität Tübingen and the Max Planck Institute (MPI) in Germany, won the PAMI Young Researcher Award at CVPR2018 in June, awarding the award to researchers with a doctorate degree in seven years and great potential for early research.
The research of Geiger is mainly focused on 3D vision understanding, segmentation, reconstruction, material and motion estimation for autonomous driving systems. He has led the construction of KITTI, a well-known data set in the field of autopilot, and a benchmark system for many autopilot computer vision tasks. KITTI is currently the largest open data set of computer vision for autopilot.
In early 2018, Geiger became the chief scientist of Surfing Tech, a company dedicated to providing multi-sensor data solutions for global autonomous driving companies. In July, the Syneced conducted an interview with Geiger on Surfing Tech, discussing the characteristics of computer vision tasks in the field of autopilot, the research frontiers and the latest progress of KITTI data sets.
Syneced: Which modules are the components of the automatic driving system? What are their dependencies?
The automatic driving system usually has a very classic and modular pipeline.
First, perception stack, which gives information from maps, three-dimensional sensors and two-dimensional sensors to the "world model". The world model aggregates the above information into a map, understands the position of different objects at each moment relative to road surface, road line and so on, and predicts the alternative paths at the next moment. Then there is a planning model to make decisions. Decision-making is also hierarchical. Coarse-grained decision-making needs to decide how to go from point A to point B, which is similar to GPS. In addition, there are many fine-grained decision-making tasks, such as deciding which lane to take, whether to temporarily occupy the opposite lane to complete overtaking, and how much speed should be set. Finally, the control module, which controls all the controllers, has high-level controllers, such as ESP, as well as the most basic controllers, such as controllers that control each wheel for acceleration and braking.
Syneced: What computer vision tasks do we need to accomplish first in order to make an automatic driving system to make the right decisions?
First, vehicle positioning: measuring the movement of the vehicle and positioning in the map. This part of work is accomplished by visual odometry system and localization system. The difference between them is that visual ranging estimates the relative motion of the vehicle relative to the previous time step, while positioning estimates the global motion of the vehicle in the map. Location can be accurate to centimeter level. Vehicle distances from fixed objects (e.g. poles) on some maps are already known. Based on this information, vehicles can already make fairly good path planning.
Then 3D vision reconstruction, usually in the range of 50-80 meters, the specific requirements depend on the speed of driving. Most STOA autopilot systems will use laser radar (LiDAR) for 3D reconstruction. However, a small number of teams try to recover three-dimensional information directly from the image. Because the data in the image is noisier, image-based reconstruction is a more challenging task.
In addition to rebuilding, you also need to have a good understanding of what is happening in front of the vehicle. Therefore, you need to detect objects, and further classify them on the basis of understanding what they are. Detection and classification will help predict their future trajectories. There are many ways to detect and classify objects. You can draw a bounding box for each object: This is the most common way, but automatic driving requires motion planning in a three-dimensional physical world, so you need at least a three-dimensional boundary box.
More precise are instance segmentation and semantic segmentation. When an object is shaped like a concave or a tunnel or something that needs to be traveled through, the boundary frame is obviously not enough. Instance segmentation classifies all the pixels of each instance in an image that belongs to some specific target category into one category. Case segmentation is usually performed on two-dimensional images, but there are also three-dimensional versions. Three-dimensional case segmentation is basically equivalent to object reconstruction. Semantic segmentation assigns a semantic label to each pixel in the image, and different instances of the same category do not distinguish. In addition, panoptic segmentation can be regarded as a combination of instance segmentation and semantic segmentation. Panoramic segmentation also distinguishes between categories that have no instances but are holistic, such as sky and vegetation. The sky can't be framed with a border pillar, and vegetation needs to be avoided at ordinary times, but the system also needs to know that it's not a big problem for cars to rush onto the lawn in an emergency (compared with hitting trees or pedestrians). Therefore, semantic information is necessary.
Next is motion estimation. According to the previous frame or several frames, the position of each point in the field of vision, or each object, in the next frame is estimated. For some objects, such as vehicles, their movement is relatively easy to predict, so the motion model can predict with higher accuracy. Other objects, such as pedestrians, can change their trajectories very suddenly, which makes it more difficult to build a motion model. Even so, short time interval (2-3 seconds) motion prediction still plays an important role in the decision-making process in crowded scenarios with more dynamic objects.
The above tasks are all independent, but in fact, the systems that collect the above information do not operate independently. Therefore, contextual reasoning can also help to give more accurate prediction. For example, a group of pedestrians usually wait for the red light and cross the road at the same time. When one car tries to parallel, the other car will brake and give way. It will be easier to understand complex scenes with these external information and prior knowledge as constraints.
Finally, one area that I think is very important but which has not attracted much attention is reasoning under uncertainty. Uncertainty is bound to be included in the data obtained by human senses or vehicle sensors. Therefore, how to accurately assess the uncertainty, taking into account both "minimizing risk" and "completing tasks", is an important topic. Ideally, all the above tasks of detection, segmentation, reconstruction and location should be carried out under uncertain constraints, and the system should know what mistakes it may make before proceeding.
Syneced: How to classify computer vision tasks related to automatic driving? What are the criteria for classification?
Classifying by input is a common practice. According to the source of input, it can be divided into data from lidar, camera, radar and other instruments in the car. According to the input representation, it can also be classified. The sparse point cloud given by lidar and the dense two-dimensional image given by camera are two different representations, and the algorithms adopted are also different. Dimensions can also be categorized. Algorithms for three-dimensional input are usually more complex, because without special approaches, three-dimensional input can quickly deplete memory resources.
Another classification is based on clues. Clues can be divided into semantic cues and geometric cues. Geometric cues use multiple images to get depth information through feature matching and triangular alignment. But because the error of this estimation is square with distance, it has great limitations. In other words, human visual system is not suitable for driving, because our visual system is designed to operate within the distance of open hands. When driving, humans use semantic clues to compensate for this defect: even if there is only one picture, which theoretically does not contain distance information, humans can still estimate the relative distance of objects according to a large amount of prior knowledge. All in all, the autopilot system can get three-dimensional information by installing multiple cameras, or by installing a camera, but anticipate what will be seen by strong prior. Ideally, we want to combine the two.
Another method is to classify objects according to whether or not they move and how they move. Firstly, it is divided into static part recognition and moving object recognition. For static scenes, there are special standard reconstruction algorithms based on the assumption that everything is static. But in fact, we need to reconstruct the scene from multiple images taken at different times, which requires us to design special algorithms to deal with the moving objects in the scene. Moving objects can be divided into rigid objects and non-rigid objects.
Contact us if you need us.