AI in Autonomous Driving: Challenges & Opportunities - Tao Wang, Former Co-founder of Drive.ai

Updated: Jul 24, 2019

Robin.ly held its semi-annual conference on AI commercialization: Trends and Challenges on June 1, 2019. The event took place at the Computer History Museum in Mountain View, California, between 1pm to 6pm. This Conference took a deep dive into the world of AI, questioning and examining the latest trends through the lens of commercialization. Tech leaders shared their experience and insights across difference verticals during our featured talks and panel discussion. [View recap of event highlights.]


Tao Wang, the former Co-Founder of Drive.ai, discussed the challenges and opportunities in autonomous driving.


Tao Wang is a tech leader and entrepreneur with over 10 years of experience in AI and robotics. He is among the first to apply deep learning to self-driving cars. He was the co-founder of Drive.ai, a Silicon Valley startup specializing in level 4 self-driving cars, and led the research and engineering team. He graduated from Stanford University with a master degree in computer engineer.



Slides for this talk are available to download from our resource center. (Log-in required.)


Listen to podcast on: Apple Podcast | Spotify | Google Play Music


Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era. Subscribe to Robin.ly Newsletter to receive content updates and exclusive event notifications.


Full Transcript


Tao Wang:

OK, so AI in self driving. I think a lot of you are having this question in your mind. It's already 2019 and where is my self-driving car? There has been a lot of hype in the industry and companies have made huge promises. They have very large, very high valuations, and they are saying, okay, [in] 2018 we'll be having hundreds of self-driving driverless vehicles on the road roaming and picking up passengers, and that's going to be a commercial deployment. By 2019, we probably will have thousands, but we're not seeing them right now.


I think a couple companies have small scale pilot studies that runs these vehicles around, but most of them still have a safety driver in the driver's seat. Even if they remove the safety driver, they have something called a chaperone, which is a technical person who is able to press the emergency stop button in the backseat or the passenger seat, so I think people have realized that self-driving cars are taking longer than expected and there [is] maybe some disappointment. As you can see in the recent media articles, self-driving cars are hitting a roadblock. They are further away than what we think. They are elusive. All sorts of things.


I think definitely, AI today has hit maybe some bottlenecks in terms of self-driving car deployments, especially when it comes to removing the driver altogether, and these kinds of challenges are always coupled with opportunities, which I think is great for new companies and newcomers to the industry because as these big companies, big players, are hitting some bottlenecks, it gives a chance for new companies to try new things, and maybe one day, overcome and overtake these big companies, big incumbents. So, before I go into self-driving cars, let me give you a little bit of background about my own research back in Stanford and maybe it will help you to have a better understanding of where I come from. I started my research at Stanford AI lab. The first project I worked on was not on self-driving, but only on drones, which is kind of related. We virtue fitted these radio-controlled fixed wing aircraft, which is about this size. We fly them, and we try to get them to fly in circles in a coordinated way. So, on the left, you see this black plane leading, which we call the lead aircraft. We program it to fly in the blue circle on the right. And the picture on the left is taken by the wingman, which means the following aircraft. The following aircraft actually doesn't know where the lead aircraft is intending to go. It just follows it. But in order to do that, it needs to estimate its own state, like polls, speed, etc, as well as the polls and speed of the lead aircraft. So, we mounted the lead aircraft with very bright LEDs and use the camera in the back aircraft to track the post of the lead aircraft using computer vision techniques. This is back in around 2009 and 2010. So, the computing at that time wasn't strong enough to support deep learning. So, we use more traditional computer vision techniques. We are able to get the following airplane to fly in the red circle on the right, and it follows the lead aircraft pretty well, despite that there’s always turbulence of winds and other kinds of noise in the system.


It's around 2011 when our lab realized, okay, these neural nets have great potentials and the computing hardware is coming to a stage where we can actually make use of these neural nets, which used to be too slow for a lot of applications. So, the projects I specifically worked on was character and digit recognition using deep learning. So, given a patch, we just build a classifier on top of a neural net and try to learn which of the 62 characters you can see their upper case, lower case, and also the 10 digits. We also work with Google to build one of the largest datasets for house numbers in the world. So, these images we got (see on the left), they are from Google Street View, and they sent us a lot of images. We use a service called Amazon Mechanical Turk to extract the ground truth of the digits in a crowdsource way. And these neural nets were not called deep-learning back then. Deep learning is really the buzzword that makes everyone remember, but as researchers back then, we called them multilayer convolutional neural nets. So, we use them to classify words and also read words in the natural pictures, so that's fun. And I think GPU and hardware was developing very quickly… Moore's law. So, by 2012, the hardware is to a state where we can train neural nets that are so big that they have never been trained before. In 2012, Google Brain published a paper on building a neural net with 1 billion neurons, and they trained it using only YouTube videos and images, which are not labeled. So, we don't have any annotations to the images, and the internet doesn't know what’s in the images. Just by watching them over and over again, watching all of them, it eventually learns that there's a concept that's a cat face, right? It learned that on its own. It also learned a human face on its own. So, we looked at that result and say: this is cool, but only Google can afford it, right? Google trained it using over 1000 machines and 16,000 CPUs. But at Stanford, we obviously don't have that resource. So, we accelerated the neural nets using GPUs. Our lab was actually one of the first labs in the world to use GPU acceleration for large-scale deep nets, and we replicated the Google Brain results. We were also able to detect faces, human bodies and cat faces by just watching YouTube videos, because as you know, there are a lot of cats on YouTube. So, I think the natural question we asked about us ourselves back then was okay, we have these huge neural nets that can detect human faces, but what's the high impact application of that?


So back then, we tried a lot of things, and eventually, we decided: okay, self-driving cars looks like the cool technology that can have a lot of impact on people's everyday life, and we started to apply in deep learning on perceptions systems for self-driving cars. And the way we detect vehicles is actually quite similar to the YOLO Nets, which are the state of the art detectors for today, but back then we didn't know that, so the way we did it is we partition the image into a lot of grids, small patches for each patch; we try to estimate whether this patch is from a car or not. And of course, to decide the identity of the patch, we have to look at the bigger context. So, on the left most picture, you see this green box, which is the context of what the neural net is looking at, and it only makes a prediction, whether the middle part, the very little red patch is from the car or not. So, by running this window across the image, we get confidence over the image about whether certain parts belong to a car. And apart from that, for each patch, we can also jointly train a regression unit on top, actually several regression units, so that it predicts the coordinates of the bounding box covering the car, and we can throw in some other values for it to regress. If we have the labels for the car, how far the car is from you, which we call the depths data, we can also use the neural net to regress the depths eventually predict that by looking at a single image.


And for lanes, as you know, we need to train these neural nets, and we need to get a lot of labeled data that tells us where the car and where the lanes actually are, which is what we call ground truth. So, for cars, it's easy. We just throw it into Amazon Mechanical Turk and get people to label it for us. But for lanes, it's actually easier.


So, we mounted LiDAR on top of our vehicles back then and travel on highways. By doing that, we build a map of the highway and because we know at any point where our car is, we can actually project the lane labels into our camera images and get free data that way. So we only have to travel the highway once and then the next time we travel the same highway, we’ll have these free lane labels.


So, when we build the neural nets on the lane detection, eventually we can train neural nets to predict multiple lanes with the depth information that we extracted from the 3D maps. And on the left, you see the camera detection, and we can back project that into a 3D space. Also, on the right, you see the top down view, the predictive lanes.


Same thing for cars. We predict cars in all sorts of situations. On the left you can see that the middle part is where the detector is actually firing or being activated. And the bounding box is the result of the regression. On the right, you see we are traveling on this windy highway with very verse lighting conditions, so the sun is looking shining right into the camera. Even though we have a lot of shadows, we can still predict cars in it. Next. So, if we have the cars and the lanes labeled, we can actually also extract where the drivable surfaces are on the highway. So here the red carpet is what's the output of the neural net that tells us which part of the highway is drivable. So obviously you see the cars and trucks are not drivable, the walls are not drivable, the divider is not drivable, but everything else is drivable. You can detect the asphalt pretty reliably.


And also, we can put everything together. Because we have depth, predictions on vehicles, as well as lanes, we can place everything together into one frame. On the right, you can see we predict all these positions are these vehicles by just looking at single frame images. And this is not stereo image, it’s molecular images that we can use to predict the depths information as well. Next. I think this technology has been around for about seven years now, and it's quite encouraging to see that a lot of the industry players are picking up this technology. You can see similar components like lane detection, vehicle detection, and object detection using neural nets and deep learning. This is great work done by the Nvidia research lab.


I believe this is from Tesla. You can see, it predicts not only the bounding box for the vehicle, but the 3D bounding box of vehicles. That's also done by deep learning, and it predicts the green carpet, which is drivable, and also the lane positions.


I would say a lot of this technology we developed back at Stanford has been picked up by industry players and has landed in some of the ADAS or, L3 type of driving or L3 type of self-driving applications. That's great, but I think the bigger question is: where is level four self-driving? The level two and three are cool, but the reason why level two and three can commercialize first is because if anything happens, you can blame the driver. So, my lab mates and I started this company called drive.ai.


And over four years, we started in a little garage, and built a team of a handful of people into a team of 180. And we have driverless video demos, and we have self-driving cars running around in Texas. But eventually, we still have the safety driver in there in a real deployment.


So, I think it's a problem for the whole industry, right? If for a lot of the self-driving applications, if you don't remove the driver, there's no business case. It seems like all the players, including Waymo, Cruise, and all the big players, are having some hesitation about actually removing the driver. I think I want to talk about what's holding everyone’s back. In my opinion, I think mostly it's technical. So there, I think people used to think deep learning is the silver bullet for self-driving. And as long as you collect enough data, you throw money and throw data at the system, it will eventually get there -- it will eventually be better than the human. But I think that's not true. And this is because there are some fundamental limitations to deep learning. As a deep learning researcher myself, I think over the years, I have realized that probably deep learning alone is not going to enable driverless cars eventually. Maybe it's enough for some niche market, like places where there is nobody around, or in very structured environments where everyone is predictable. And there are legal and economic challenges derived from the technical challenges.


So, when you look at the self-driving car system today, especially level four self-driving car, this is one of the typical architecture of it. It is probably not specific to any company. But on the left, you have their sensor suite like LiDARS, cameras, GPS/IMU. And you also have maybe radars.


And in the center, you have the core algorithm part. That's where all the AI happens. That's where all the magic happens. But really, it can be broken down into localization, knowing where you are, perception prediction, knowing what the world looks like, and how the world is going to evolve over time. And the motion planning part, where you need to know where to go and how to get there locally and how you plan your path and trajectory, so that the motion of your car in the immediate future is not going to cause any collision or danger. And on the bottom, you have to onboard the infrastructure part where you store the HD maps, you monitor the health of different systems, and you also need to communicate between different processes. And on the right, you have the vehicle and user interface, which is, I think, if the center part is solved, then everything else is a solved problem. What's really hard here is the center part, which is perception and motion planning.


I think almost everyone is using deep learning for perception nowadays. But I think deep learning has this intrinsic limitation of the long tail problem where rare events requires an exponential amount of data to train and validate. If you look at this picture on the right, the blue curve is the event probability. So as the blue curve goes to the right, it means it goes into those rare events where it is hard to see in real life. It is like one car hitting another car flying around, or people wearing dinosaur costumes crossing the road. You don't see that every day. But deep learning needs a lot of data to be trained, and it has to see similar data in order to handle a new case. But it seems these events are themselves so rare, you probably need to drive 10,000 miles to see a person in dinosaur costume, but you probably need a million miles to see a plane crashing into the road, right. So, if you want to handle all these cases, then the data you need to collect also increases exponentially. And a lot of these rare events are actually not about just a stronger break, it is about life or death. So, we cannot just say okay, these things are rare, so we don't care about them. And Sacha Arnoud, which is the Director of Engineering at Waymo actually once said famously that “when you are 90%, done, you still have 90% to go”. So, it's like you never get to the end, and you never know when you will be done. So, another limitation of deep learning that is kind of related to this long tail problem is the black box problem. I think deep learning today is like alchemy in the ancient times. All the researchers are just alchemists; they are not real chemists. And because I think probably the only person in the world who has a reasonable understanding of deep learning and what's inside deep learning is probably Geoff Hinton.


Most of the researchers in this space are just trying new things and seeing whether it works. If it works, then they write a paper to explain why it works. They try to guess why it works. The theory is, I don't think there is too much mathematical provability in the auto theory. And I would say if deep learning is chemistry, then the periodic table hasn't been discovered. So as a result, I think when deep learning fails, nobody knows why. And as you can see, in some of the recent research, deep learning is actually quite susceptible to noise that are not discernible by human eyes. So, on the top, you have a school bus, you have a lion, you have a pyramid, but just by applying some tiny amounts of noise, of course, these noises are carefully chosen. But you can get the neural net, previously able to predict all these classes, to predict all these classes as an ostrich. And, to me, these two pictures seem exactly the same. And the other scary thing about deep learning, or the failure mode of the deep learning, is that there's no self-diagnosis. It doesn't know when it is not sure. So, in this case, on the bottom, the original neural net is able to predict this is a panda with like 57% confidence. But by adding some noise, some carefully chosen noise, it's almost hundred percent sure that this is a gibbon. And it is so darn sure it is a gibbon that the confidence is even higher than the confidence of the original picture.


So as this polar bear walks, the neural net predicts different classes over time. Here it thinks it's most likely to be a baboon. And as it stands up, the neural net thinks it's a meerkat. Because a meerkat stands up and looks around, right. So, this is quite interesting.


And this is even more interesting. Some researchers in Belgium took the state of our pedestrian detector and try to spoof it using a piece of printed paper. This gentleman on the right is carrying this piece of paper, and then he becomes invisible to this neural net. And this is the best neural net in the research today to detect things.


Okay, I think this video is fun. But it's also scary in some sense that if I think about I'm going to build a neural net and putting to a self-driving car system. What if someone just wears a shirt with that picture printed on top of it? And how am I going to explain to the judge that the neural net sees this picture and decides this is not a human? And this is for computer vision based on camera images, but some people may argue that you can use LiDAR or you can use radar. But as long as you use deep learning, even if you run deep learning on the LiDAR example, to detect cars, the intrinsic limitations still exist.


So what's the solution? I think one of them is to understand the inner mechanics of deep learning better, and specifically how the adversarial examples are affecting these neural nets. And why do neural nets fail on these seemingly innocuous noise? So, some MIT researchers recently discovered that there are two kinds of features and neural nets [that] can learn from standard training and some of them are robust features. Some of them are non-robust features. So robust features are ones that are robust to noise. And non-robust features are actually useful in determining the actual class of the picture, but they are very susceptible to noise.


So, what the researcher did is they construct a data set only using the non-robust features. To human eyes, that doesn't make any sense. But they are able to train the system so that it actually performs reasonable on an actual data set. So, it's like telling the neural net about all these fake features, such that it can still able to learn to predict in a normal feature. As illustrated on the right here a little more, you have a picture of a dog, and you try to add it. You try to modify it such that it contains the non-robust feature of a cat, and then you re-label this picture as a cat. To a human eye, this is completely labeled wrong.


For non-robust features, you can train the neural nets and evaluate on the original data set. Then the spoofed neural net is able to predict the cat. That tells us the non-robust features are actually quite powerful. And if we can separate these robust features and non-robust features, we will be able to have a more robust system. Also, some MIT researchers recently published a paper to validate any kind of image with, with all kinds of variation that can possibly happen, and see whether the classification actually changes. If it doesn't change, then that means this picture is probably not being spoofed. And I think on motion planning side, if we want to get better guarantees on safety, there are frameworks that [are] proposed by the industry players like Mobileye and Nvidia that require us to do a more robust mathematical proof about the behavior of the system so that there is no at fault collision. I think the remaining question is, where can we improve the uncertainty on the perception and prediction system. So finally, some suggestions in the self-driving industry: I think people have to move away from these purely empirical methods to validate because on average, humans drive 1.6 million miles to run into an accident. And for this number to be statistically significant, the random research states that you will need to drive a billion self-driving miles with the miles per intervention equal or larger than the human accidents rate. But Waymo’s number is only 11,000 today. So that's still more than 10 times away. Also, you need to drive a billion miles to validate that number, so it's just not feasible with today's technology. I think it takes all the academia and the industry to think carefully about how do we actually validate the system more efficiently, and with that, I think I'll end my speech here. Thank you, everyone.


Audience Questions Highlights:


Audience: When you put your simulation environment to the real environment, transfer it, what kind of difficulties [are there]?


Tao Wang:

I think there's no easy answer. It's a real problem if you try to train that system in simulation, and you try to put it in the real world. It's not just the cars look different, the behavior of the agents is also different. I think, fundamentally, I don't know if using huge amounts of data to train the system is actually helpful on at least the safety front. I think it's definitely helpful to train the system based on human interactions and try to improve the interaction of the self-driving car and the world and try to get the car a little more humanly or get the user experience to be better. But I don't think it gives you any stronger safety guarantees.


Audience: So, is it fair to say that this deep reinforced learning, which is ideally should be good for controlling, it has not been successful?


Tao Wang:

I don't think it has been successful in the self-driving industry.


Audience: Okay. Is it because of the conventional control methods aren’t good enough?


Tao Wang:

I think what's holding back self-driving car is not the controls part. It's really the perception, prediction and planning. And I think deep reinforcement learning can do planning, but you can only train it in a very constrained and deterministic environment, like for AlphaGo. You know every single rule of go, you can see every single stone, you can see the entire board. So, it's a completely observable system, observable word, and there's no uncertainty.


Audience: So, regarding the person detection, there are some new work on using thermal sensors for humans or living objects. Any thoughts on sensor fusion, where you take LiDAR and other stuff but also a thermal sensor that can detect if it's a living object or not? And any thoughts on that?


Tao Wang:

Yeah, I think that's a great idea. I do think we need multimodal sensors to give us more information. So, the analogy is, if you use black and white images to see the color of a traffic light, you are not going to be very successful. You can still train on the internet on top of that and probably infer. If the light on the bottom is lit up, it is probably green. But if you go to another state, the order changes, then it doesn't work. But if you add a new dimension to the data, you use color images, that gives you a lot of confidence. And it may be able to bypass the entire deep learning training part where you just look at the pixel RGB value, and it will just tell you whether it's red or green. So, for humans, if you know is warm, then you don't hit it.


Audience: Are you considering single agent when you're training your model? Or are you considering multiple agents? Like are you thinking of just one self-driving car on the road or is it an environment filled with multiple self-driving cars?


Tao Wang:

I think most of the industry is using single agent models. That's because the future of having tens of thousands of self-driving cars interact[ing] together, that's pretty far away, but I think some of the multi-agent or some of the game theory implementations can be used in interaction of the self-driving car and the surroundings [of] other agents like humans and cars in the environment.


Audience: You also mentioned about AlphaGo Zero. AlphaGo is very predictable as the environment is [that] you understand every single state and action in the system. But whereas when you look at self-driving cars, it's in a realistic environment. I think that's one of the major challenges, so do you think there is any work which is being done, resolving this particular issue?


Tao Wang:

I think the whole industry is working on it. I think if your question is whether there’s work in reinforcement learning, towards helping self-driving cars, I think there's a lot of great work in academia. But I think the industry has still hasn't started to pick that up.


Audience: So, what do you think is different computationally with what humans are doing to allow them to solve this problem? And then what? What's the future for research that you guys are doing?


Tao Wang:

I think one point that came across to me right now is that I feel humans probably have two different modes in terms of driving. I'm not a psychologist, so I'm just speaking of my own experience. When I drive, there's the inference part where I try to infer, okay, is this car going to change lanes? Is this pedestrian going to cross? There's also the reactive part, the reflex part that if I'm driving down an alley, and there is a line of car parked next to me, someone's sways their door open, I'm not going to make a whole inference about okay, is this man trying to step out. I just see that obstacle, I don't even know where what it is, it just looks like something bad, and I'll slam on my brake. So, I think these two are different modes of computation and I think the industry seem to mix these two right now. I know, there are some companies that do like prediction on whether a car has a door open using a new deep neural net, and I feel this is probably not the best problem for deep neural net to solve, but it doesn't matter if it's a door or some random matrix that sticking out of the car. You should be able to see that there is a thing and you just don't hit it.


Audience: So, my question is a little more business focused than the tech side, but self-driving cars are kind of similar to parallel to early commercial aviation development. And here we are 100 years later, and we still have highly trained pilots, even though the planes basically fly themselves. So why do we expect that we can take a driver out of autonomous vehicles in the next five years, and especially when aviation is a much less constrained environment.


Tao Wang:

So aviation is much less constrained, but aviation is, I would argue, way more safety critical than maybe a self-driving car. And also, I think nowadays, planes fly most of the way, but they still have humans to handle the takeoff and landing part. So, I know there are some companies trying to work on highway autonomy and getting that truck driver to only drive the truck onto the highway and off the highway. And on the highway, it just drives its own and the truck driver can sleep. I think there's still some business value in that model.

Leadership, entrepreneurship, and AI insights

  • LinkedIn Social Icon
  • YouTube Social  Icon
  • Twitter Social Icon
  • Facebook Social Icon
  • Instagram Social Icon
  • SoundCloud Social Icon
  • Spotify Social Icon

© 2018 by Robin.ly. Provided by CrossCircles Inc.