Updated: Feb 19
This episode is a live recording of our interview with Alexander Toshev at the CVPR 2019 conference. Alexander Toshev is a Research Scientist at Google Robotics. Toshev's research primarily focuses on machine perception and robotics. He received his PhD in Computer and Information Sciences from the University of Pennsylvania in 2010. He has served as Program Committee member for CVPR, ECCV and ICCV, and Area Chair for CVPR 2017.
Toshev presented a tutorial titled "Deep Learning for Semantic Visual Navigation" at CVPR 2019. His presentation focused on solving the visual navigation uncertainty in autonomous systems. During the interview, he shared key applications of the workshop he hosted at the conference and current navigation problems in robotics.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Wenli: We have Alexander Toshev here with us. He’s a staff research scientist at Google. And he's also the organizer of the workshop “Deep Learning for Semantic Visual Navigation”. Thank you so much for coming here to join us. Would you like to share something about the workshop with the community that couldn't make it to CVPR this year?
Yeah, this is an exciting area to keep an eye on. Navigation in the new world of learning presents a lot of interesting challenges. The ability to reason about large physical spaces and make decisions based on visual inputs touches the core of many machine learning problems. If we can consolidate things, we can have more autonomous systems which can work in less constrained environments and adapt, which is really great compared to the classical systems to solve these challenges.
Wenli: When you were thinking about hosting this workshop, which co-organizers and speakers did you invited? Does each of them target a problem in this area?
The idea came about with one of my colleagues at Google. We sought out a few of our collaborators externally. This was the process. Then the organizers started to find speakers who can cover all the topics we are interested in and can give us different points of view.
Wenli: You also told me earlier that you learned so much from the workshop. What did you learn? Any highlights?
Alexander Toshev: The highlights are that, first of all, people are really looking at the problem of navigation, which is an old, overloaded word from a different angle. The angle is that it is not just about geometry and planning in a geometric space; it is about visual spatial reasoning. It is about reasoning how to act in a space where you can make many decisions, and there are many aspects to be taken into account. In particular, where you have little prior knowledge where the space is changing, you have to make complex reasoning decisions. This is what the communication is rising to the fact that no navigation can be actually proxy for more complicated reasoning tasks. That's the new way of looking at it. That's one of my take-home messages, but it is still a problem that some people don't realize this, and some people don’t think about it in this particular way. There isn't actually a consensus of exactly how to define the problems.
Wenli: From your perspective, what are the general applications of deep learning for semantic problems? I know there are still a lot of problems.
I think in general for most of the robotics, when it comes to mobility, it works pretty well if you have static environments, and you have full knowledge of the space, but the moment you start removing all these constraints, then the problems become harder. For example, if you have a really good map of this building, a 3D model, and you have no people, a ghost building, it's very easy to have robots which can go from one place to another, but if you remove these constraints, you don't know about this environment, you don't have a map, and you have people moving around, then you define the goal as not going to coordinate but [as] go and find coffee, classical systems cannot achieve this. They don’t know how to deal with all the uncertainty of not having a map. They don't know how to deal with all these dynamic objects. They don't have the ability to capture some semantics, coffee versus coordinate. So the application is basically when you try to place autonomous systems in environments which have less information and are very dynamic, where the system needs to adapt to understand the semantics and the geometry.
Wenli: Yeah, that's really important for the real world applications. There're so many dynamic objects and also domains happening. What are the current challenges, some of the challenges and problems that you haven't mentioned yet?
Alexander Toshev: Technically, what are the current challenges is a very good question because when we talk about semantic navigation, visual navigation, people understand various things. There are few well-defined problems, and there are other problems which are not so well defined [that] people are talking about. Having a well-defined benchmark to drive the community is one challenge. For example, one problem would be, if I am in an environment where you haven't been in before, I would go and find a bathroom or find an object, that is one important challenge. So imagine here I can go to the bathroom, but I haven’t been here before. So finding objects or location by their class label in unexplored environments is a problem people should be looking into more.
Wenli: Interesting. Thank you so much for sharing that. We know that you have a paper accepted by CVPR this year called “Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks”. Can you briefly tell us about the paper?
We’re looking at navigation problems. The problem of performing navigation in exactly this setup where you don't know anything about your environment, and the goal is an object, not a location. And so in this particular piece of work, you look at an aspect of this problem. And the aspect is, we would like to have a controller, which takes new input observations and outputs actions, which can get the robot to the object it needs to find.
There is a class of algorithms which can be used to do this for general robotic problems. You can have neural networks and you train these neural networks with reinforcement learning. For many of the robotics problems, however, the history of what you see along the way to your object doesn't matter. For example, for manipulation, which is a different robotic problem you don't really need to memorize what you've done in the past in order to take an action. For example, if I am to move a dripper towards a bottle, now as I look at the dripper, I know I need to go towards the bottle, but whether my arm came from here or from here, it really doesn't matter when I want to grasp the bottle.
But if I'm navigating, and let's say I'm looking for a bottle, and I go first left, and I don't find the bottle and I come back. Here I need to know that when I went to left, I didn't find it. Otherwise, I might just go left again, which is stupid. For navigation, you need models which can memorize a long history. So that’s a research problem. In this paper, we studied this problem. We designed a neural network, which has an external memory, then we trained this neural network with external memory, with RL for navigation. We show that this memory for a variety of navigation tasks helps your solve navigation problems. It improves the performance of the system compared to neural networks which have no memory or very small memory.
Wenli: This scenario is based on that the robot doesn't see the object.
Alexander Toshev: Yeah. Oftentimes, a robot can be looking for an object in an environment and if it doesn't have a memory, it might just loop around. It doesn't know it's been there and picked the same decisions, but if it has a memory, the moment it does a loop, it will go somewhere else, and it will eventually find the object.
Wenli: Did you provide a solution or guidance?
There are various ways to train this system. The one we picked was, initially the robot is kind of random and then the moment it starts getting to the object, it gets a reward of cookies. And so over time, there is a way to train the system based on this reward and kind of give a guidance of what behaviors are good and then emphasize these behaviors, and the one which you don't get a cookie. Then they get it emphasized and eventually, the neural network learns these behaviors.
Because the reward is only given when you succeed, the reward doesn't tell you how to succeed. The optimization process needs to find these behaviors.
Wenli: What would be the future on business applications? Are you already working on it?
No, that’s purely academic research. I don't think this is a question that can be answered, and I don’t know the future applications yet. There’re a lot of core problems which need to be solved before one starts thinking about that.
Wenli: You are working for Google as a research scientist. Do you realize the differences between working in academia and in industry research lab? I believe that in the industry, you're more focused on the product like what can be applied. Do you feel the same freedom?
I cannot compare to academia. But in general the industry now, if you look at CVPR, a lot of companies are present, and they’re solving fundamental problems in computer vision.
Wenli: So the line is a bit blurred?
Alexander Toshev: Yeah, the line is very blurred. As a society, as a community, we need to solve these fundamental problems. Both universities and industries are operating in this. This is a pretty exciting time to be in industry. You have the intellectual freedom to pick important problems to work on.
Wenli: What would be the next milestone that you want to deliver next year?
That’s a good question. In general, we would like to push along the lines of this problem and try to make progress. You find a problem and make progress in some of the dimensions. But it's really hard to answer this question. Things can change very quickly.
Wenli: Even CVPR is changing quickly. The number of paper increased dramatically, and the number of people attending increased dramatically. And the whole industry is receiving so much attention now. That's why you're saying that it is an exciting time to be in this industry and figure out which direction you want to work on. And I feel like each direction is full of opportunities and excitement, but there are also lots of problems that need to be solved.
Yeah, that’s true. There are a lot of problems to be solved. It’s so exciting.
Wenli: You like to solve problems as a scientist, that’s very nice. Thank you so much for coming here.
Thank you very much for having me.