A Practitioner’s Tour of CVPR 2019: Workshops on Autonomous Driving

Updated: Jul 28, 2019

Editor's Note: This article is originally posted by Patrick Liu at Medium, senior staff software engineer and manager at a fast-growing autonomous driving startup. Robin.ly is authorized by the author to repost it.



This year is the second year I have attended CVPR. It is always exciting to be exposed to such a diverse collection of recent advances in AI during such a short period of time. Coming from a background of physics where it is complex to follow what others are doing even in a very similar subfield, I find it amazing that you can communicate intelligently with almost anyone at the conference, despite attendees coming from very different backgrounds. As I am now in a fast-growing autonomous driving startup, this post will inevitably focus more on fields relevant to this.


CVPR 2019 at Long Beach, California, photo courtesy of Patrick Liu

Workshops and Tutorials

There are always multiple workshops and tutorials at CVPR, so it would be impossible to attend them all. However, note that I could only sample so many from the 5000+ papers presented at this year’s conference. I will only be able to review the most interesting talks. If you want to look at more detailed notes I took during the meeting, please refer to my github repo.

Vision for All Seasons: Bad Weather and Nighttime

This workshop talks about the challenges and possible solutions for perception under adverse weather and conditions, especially for the tasks of autonomous driving.


As one of the speakers put it, vision for four seasons is essentially a domain adaptation problem. Here are the three main ways to solve it:

  • Better hardware

  • Better data

  • Better algorithms

Better hardware focuses on improving the current sensing modality, such as camera and lidar for the task of perception under adverse conditions. Normal camera and lidar are optimized for normal driving conditions (e.g., sunny and clear weather), and their performance drops significantly under adverse weather and illumination conditions (e.g. fog, rain, snow, ice, low light, nighttime, glare and shadows). Other than incremental improvements to sensors such as resolution or noise performance under low-light conditions, there have been efforts towards innovating new type of sensors. Among them are gated imaging from Daimler and smart illumination from CMU, both aiming to improve imaging with smart illumination.

Gated Imaging from http://www.brightwayvision.com/technology/#how_it_works

from Prof. Srinivasa Narasimhan https://www.cs.cmu.edu/smartheadlight/

Better data means we need to have a more diverse and balanced dataset to better represent real driving scenarios. The core issue with autonomous driving is long tail and corner cases with corner cases implying two things:

  • First, for known corner cases, we have extreme data imbalance, which poses severe challenges to machine learning and deep learning algorithms. We have to properly massage our training data to train DL models efficiently, and curate a properly portioned test data to accurately and realistically measure the performance of the trained model.

  • Second, for unknown corner cases, we should be able to identify them and handle the unknown cases gracefully.


Large-Scale Long-Tailed Recognition in an Open World (https://arxiv.org/abs/1904.05160)

Different companies have similar approaches on how to curate an “effective” training dataset. Most of the data pipeline involves recognizing rare cases (in particular, ones that the algorithms perform worse on).


One company’s data pipeline

Another company’s data pipeline

Better algorithm is the focus of most papers at CVPR. Most algorithms tackle the four-season vision issue through domain adaptation, with GAN-powered style transfers, or through the use of domain-agnostic features.


Towards a canonical representation for robot vision under difficult conditions (by Horia Porav from U of Oxford)

by Alex Kendall from U of Cambridge and Wayve

My feeling is that the academia focuses on better algorithms while the grunt work of dataset collection and curation is underrated. Nowadays in CVPR and other top AI conferences, even if your paper is about a new dataset, you have to propose a novel method to beat all the SOTA benchmark methods on your newly proposed tasks. In my opinion, a well-written dataset paper should focus on describing the characteristics of the dataset, providing a user-friendly dev-kit and proposing realistic and application relevant metrics. If a dataset paper archives these three, they should be well received and accepted, although possibly on a separate track. (On that matter, I am really glad to see that the ImageNet paper received this year’s PAMI Longuet-Higgins Prize, a retrospective award which recognizes a CVPR paper for enduring relevance and tremendous contributions over a 10-year period).


The industry, however, favors the brute-force method of collecting more corner case data, money and time permitting. Therefore, the engineering effort of collecting and curating data is also equally as important as, if not more than, algorithm development.


Workshop on Autonomous Driving (Monday session)

The opening talk of the session, performing 3D object detection without lidar sensors using pseudo-lidar and pseudo-lidar++, is from the group led by Prof Kilian Weinberger from Cornell University (they are also the creators of DenseNet, Stochastic Depth, etc).


3D object detection is one critical task for many areas of autonomous driving. Sensor Fusion, Prediction and behavior planning are all done in 3D space or birds’ eye view (BEV) space. In many autonomous driving applications the elevation information in 3D is not as important as planar coordinates info, and thus in this post we loosely equate BEV space, which is strictly speaking a 2D space, to 3D space. In a sense, 2D object detection in perspective space is not the ultimate goal for perception in autonomous driving. The ultimate goal is to obtain an oriented 3D bounding box around each object and predict their associated properties, i.e., 3D object detection.


The current state-of-the-art method for 3D object detection is either based on lidar data (e.g., Point-RCNN from CVPR 2019, PIXOR from CVPR 2018) or based on early data fusion of lidar and camera data (e.g., AVOD from IROS 2018, MMF from CVPR 2019). The main problem with the approaches utilizing lidar data is the high cost of lidar sensors. Although there are numerous startups claiming to have low-cost options for lidar, the go-to options from Velodyne are still quite expensive in the range of tens of thousands of dollars per lidar device. Personally, I believe there are many years to go before lidar becomes ubiquitous and cheap, such as how RGB cameras are. Before that, monocular 3D object detection using only camera data is a very important and practical task.


Why can’t we just simply ditch lidar data? There has been a huge gap between the performance of object detection algorithms solely based on camera images and those utilizing lidar data. This has been mainly attributed to the inaccuracies for image-based depth estimation (using either monocular or stereo vision, or the latest trend in monocular depth estimation from video). This is particularly the case for faraway objects, as the errors in depth estimation for vision-based methods scale quadratically with distance, while time-of-flight methods such as lidar only scale linearly with distance.


However, the authors of pseudo-lidar propose that the main reason for the huge gap lies in data representation instead of inaccurate depth estimation. Instead of doing 2D or 3D object detection directly from the perspective image space, they use an estimated dense depth map to project each pixel in the 2D perspective image to 3D space to form a dense point cloud named “pseudo-lidar” data.



2D Perspective space and BEV space (from DOI: 10.1109/TITS.2015.2479925)

Once the perspective image has been converted to a pseudo-lidar point cloud, then all existing SOTA frameworks on lidar data can be leveraged for 3D object detection. This change of data representation alone almost doubles or in some categories triples the metrics for 3D object detection for camera image-based methods!


Admittedly, the idea of projecting 2D to 3D data based on depth for object detection is not new and has been experimented with in literature in the last few years. To name a few, Frustum PointNet (CVPR 2018) reprojects RGB-D data collected from both indoors and outdoors to 3D space first, then selects the points lying in the Frustum subtended by the 2D bounding box from perspective view for 3D object detection. Concurrently, MMF: multi-task multi-sensor fusion (CVPR 2019) proposes to jointly learn a depth completion task and project the image point to 3D space to become pseudo-lidar points. A more influential paper to Pseudo-lidar is the MLF: multi-level fusion based 3D object detection from monocular image (CVPR 2018), which could have largely inspired the pseudo-lidar paper as the latter heavily cited the MLF paper. Pity that MLF only uses the features obtained from the point cloud to enhance the image-centric object detection pipeline.


F-PointNet (CVPR 2018) projects RGBD data to 3D space

MMF (CVPR 2019) uses depth completion task to help convert image points to Pseudo-lidar

MLF (CVPR 2018) project image points to 3D to form point cloud but only to enhance 2D-centric pipeline

That said, pseudo-lidar is the first paper that clearly states the importance of data representation in monocular 3D object detection. The paper has an excellent session “data representation matters” regarding the reason behind it.


The central assumption of convolution is two-fold: (a) local neighborhoods in the image have meaning and (b) all neighborhoods should be operated upon in a similar manner. Yet this assumption is imperfect for perspective images. Concretely, if two pixels close to each other in the perspective image straddle object boundaries, then they could be very far away in 3D space. If we perform a 2D convolution with the depth map, we would see severe edge bleeding in 3D space, as shown below. In contrast, convolution in BEV space operates on neighbors containing points that are physically close to each other. (In addition, the bonus point of doing object detection in BEV space is that both distant or nearby objects have the same scale, largely eliminating the painstaking anchor crafting in most of the 2D object detection pipeline.)


Convolution cannot be applied to depth map! (Fig. 3 from Pseudo-lidar)

In the paper, the authors also noted that their main contribution is through the use of camera-based 3D object detection, noting that switching to pseudo-lidar based methods doubles or triples the SOTA for monocular 3D object detection, closing the gap with the gold standard of doing 3D object detection with real lidar measurement data.


When I read this paper back in March 2019, I noted to myself that we could use a few-line lidar to improve the depth estimation, and lo and behold, three months later I found that the authors of pseudo-lidar had the same idea and implemented it already!


In pseudo-lidar++, the authors improved the way depth estimation has been done, and more importantly, they simulated it using 4 or 2-line lidar measurements, further closing the gap between monocular 3D object detection and those using lidar data.


Using few depth measurements to correct the systematic bias of estimated depth from RGB image

The basic idea is that although the depth estimation from monocular or even stereo images are inaccurate, the error is largely that of a systematic bias rather than via randomness. With very few accurate measurements (one or a few per object), we can correct for the depth of the entire object, thus improving the 3D object detection results.


Performance of Pseudo-lidar and Pseudo-lidar++ in 3D object detection

Now let’s look at some quantitative results. Practically, even without any hardware assistance from few-line lidars the performance (see the brown curves from pseudo-lidar) is already largely comparable with the SOTA methods with the help of 64 beam lidar data. I really believe the idea of pseudo-lidar and the ++ version will revolutionize the way people perform 3D object detection. Perhaps people really don’t need lidars after all when cameras can see well.


Alex Kendall from Cambridge and Wayve also gave an excellent talk on interpretability and uncertainty of DL models. He and Yarin Gal are the pioneers of applying Bayesian DL in autonomous driving. In particular, I highly recommend this dissertation from Alex and the dissertation and this presentation from Yarin if you want to learn more about this topic.


As we mentioned above, driving data is exceptionally biased. How to turn long-tail distribution to normal distribution is a key task for autonomous driving. The long tail is not only a perception problem but also a prediction and control problem. As for perception, we could use extensive engineering efforts to alleviate the effect of corner cases. For control, there are existing methods such as DAgger (Dataset Aggregation) to deliberately collect more data by deploying the policy to off-course scenarios and ask the expert to provide feedback.


How to get more human guidance in an exploration setting? The answer (or at least one plausible one) is simulation. But how do we solve the infamous Sim2real problem? Alex Kendall argued that we need to learn a proper representation,”Learning to Drive from Simulation without Real World Labels” proposes to learn latent space for domain adaption and control jointly.


On a side note, he also talked about the differences of Autonomous Driving with games such as Go and DOTA.


  • Games: states are easy (discrete, fully observable or noise-free), but action space is huge

  • Autonomous driving: state space is huge (long tail, noisy), but action space is simple

As one of the pioneers to apply Bayesian DL, Alex also mentioned about the Interpretation/Verification of DL representation. He proposes three papers on this topic:

Most famously is the distinction between aleatoric uncertainty and epistemic uncertainty. The latter is the model uncertainty which can be explained away by more data, and thus should be the focus of the industry. Aleatoric uncertainty is a sensing issue and stays largely the same for out-of-sample data and in-sample data alike. As shown in the image below, aleatoric uncertainty depicts the outline of objects, which cannot be explained away with more data and is more related to the limitation of using RGB camera for semantic segmentation.


The famous illustration of Aleatoric and Epistemic Uncertainty from Alex Kendall’s NIPS 2017 paper.

Raquel Urtasun from Uber ATG and U Toronto talked about applying DL in the full technical stack of autonomous driving.


The traditional engineering stack in autonomous driving involves Perception, Tracking (of frame-based perception), Prediction (of other traffic agents), Planning (of ego vehicle) and Control. Deep learning has been applied to each of the modules before, but simply put there are drawbacks, as compared to an end2end method. (These points were actually echoed in one of their orals in CVPR2018.)


The traditional engineering stack for autonomous driving

Simply putting together a chain of DL models has the following drawbacks:

  • Hard to propagate uncertainty in a chain of deep learning models. You may argue that we could use Extended Kalman Filter (EKF) to propagate the uncertainties from each module, but even for that, we need calibrated DL models.

  • Computation not shared between modules.

  • Each module is trained separately to optimize different objectives.

The route Uber ATG took is to extend the DL-based perception block to gradually incorporate the downstream blocks.


Detection and Tracking:

Detection, Tracking and Prediction:

Detection, tracking, prediction and planning:


End-to-end perception, prediction and planning

Out of all the papers, the fast and furious and the neural motion planner talks are highly impressive, in particular the demos.


Another track of research I’ve observed for Uber’s ATG in recent years is the multi-sensor fusion approach. In particular, MMF is the current SOTA for 3D object detection. My only pet peeve about their research is that they seldom open-source their research or dataset, making it hard to understand the technical details if the paper is not well-written (some papers’ technical sessions really need a rewrite).



About me and the company

Trained as a physicist, I have worked in fields of X-ray detector design (my PhD thesis), anomaly detection in semiconductor devices, medical AI and autonomous driving. I am currently working on perception (more specifically in the bourgeoning field of DL-based radar perception) at a fast-growing startup aiming to deliver autonomous driving consumer cars. We have offices both in San Diego and Silicon Valley. If you are passionate about deep learning, or you are a hard-core DSP algorithm engineer, or simply want to say hi, please contact me!



Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.


Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:


Leadership, entrepreneurship, and AI insights

  • LinkedIn Social Icon
  • YouTube Social  Icon
  • Twitter Social Icon
  • Facebook Social Icon
  • Instagram Social Icon
  • SoundCloud Social Icon
  • Spotify Social Icon

© 2018 by Robin.ly. Provided by CrossCircles Inc.