Updated: Jul 30, 2019
This episode is a live recording of our interview with Srinath Sridhar and He Wang at the CVPR 2019 conference. Dr. Srinath Sridhar is a postdoctoral researcher at the Geometric Computing Group at Stanford University and He Wang is a Ph.D. student at Stanford University and was a student researcher at Google.
Their paper "Normalized Object Coordinate Space for Category-Level 6D Object Post and Size Estimation” presented at CVPR 2019 focuses on finding a way to predict the pose and size of objects that have never been seen before. During the interview, they shared important contributions and key applications from the paper, as well as the challenges they are currently facing.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Wenli: Nice meeting you both. So good to have you here. Thank you so much for joining Robin.ly. Can you introduce yourself to our audience
So my name is Srinath Sridhar. I am a postdoctoral researcher at Stanford University, where I work on problems in computer vision, digitizing human physical skills and human interaction.
My name is He. I am a PhD student at Stanford University. My research focus is mainly on understanding human-object interaction.
Wenli: I know that you both just submitted a recent paper called “Normalized Object Coordinate Space for Category-Level 6D Object Post and Size Estimation”.
Yeah. Let me start with an example. As a human, when you enter a new environment, you probably see a lot of objects that you're pretty familiar with. For example, a lot of mugs, a lot of phones. But the problem is you’re interested in where they are, and what is the orientation of the objects. So knowing that will help you to better manipulate them. And definitely, it will help a lot if the agent is a robot.
And the key thing here is although for all of these objects, you are pretty familiar with their categories, you might not know them before you see them. For example, a mug you see probably is quite unique. And although you know there is a handle, the body of the mug, but you never truly see this item before. So a lot of classic algorithms which rely on the exact model of the mug cannot be applied here. This is the motivation of our paper; we want to estimate the orientation and the location of objects. You are pretty familiar with the category, but you never see the item before. Our paper tackles with estimating the pose, which is the orientation and the location of the objects, and also estimating the size of the objects from RGB-D images.
So that's colored images with depth images as well. And we want to do this in three dimensions. So when we go to a living space, we've never seen a particular object before, but we'd like to know the 3D position and the 3D orientation of this particular object.
Wenli: Interesting. I actually never thought about this before talking to you guys. It’s not something you would normally think of in day-to-day life. So among all the citations and all the fields, why did you choose this particular field to study?
In computer vision, this is called “the 6D pose estimation problem”, and this is quite an important problem in many application areas. For example, in autonomous driving, robotics and augmented and virtual reality we need to have a good sense of where objects are in an environment. So we need to know exactly where the couches are, where the bottles are, where the cups are, where the cars are, and we need to do this in previously unseen environment. So we need to operate in environments that we've never been in before. For instance, I've never been to this part of Long Beach before. But as a human, I have no difficulty operating this part. I know exactly where the couch is, I know exactly where the objects are and so on. So we want to endow computers with this capability. And that is exactly why among many problems, we think that this is a very important problem to solve.
Wenli: So in this problem and also the paper that you just submitted, what are some of the most innovative methodologies that you have proposed in the paper?
I think our main contribution is two-fold. The first comes with our novel representation of which is called normalized arbitrary coordinate space (NOCS). And basically, in our task, we try to find where is the object, which is our instance segmentation problem, and also estimate the 3D sides and 6D pose . We need a representation to unify all the problems and we can solve them together. So our key ingredient is this normal representation, normalized object coordinate space for your short NOCS.
And to explain what this is, basically, take this phone as an example. When the camera is looking at this phone, in the camera’s reference frame, if you put your eye at the camera center, you will see: oh, this camera is probably two meters away from me and a little bit lower, below the horizon. And this is what we call camera world coordinates of this phone. But if you live in the canonical world of this phone, basically let's say you live on the surface of this phone, you'll find: okay, I am sitting right at the center of my world and this phone is centered at (0,0,0). And also, it is aligned with all of the phones; there is a canonical orientation of this phone. Basically, let's say this direction faces the left and every phone in their own canonical, or normalized object coordinate space are aligned to a direction. And so basically this normalized object coordinate space of all different phones have the same orientation and their size are normalized inside a unit cube. So they are in the same scale, and the same orientation, and zero centered. So this is its own coordinate space, compared with the camera’s coordinate space. And our task is to find the correspondence between this camera coordinate space and its own coordinate space. This information will help us to solve the 6D pose and 3D size estimation problem.
One more innovation to add is, as you know, we use deep learning methods to solve this particular task that He just described. And one of the challenges is, there are no existing data sets, because deep learning runs on supervised data. So we need lots and lots of data to show examples.
Wenli: Where do you get that data?
So we have this new way of collecting data called Mixed Reality Approach, it's like augmented reality, but it's the inverse way of using augmented reality. So we don't use augmented reality as an experience for users, rather to train the algorithms that we want to make use of. So it works as follows. We went to IKEA, and I’m going to explain why we went to IKEA. But we went to IKEA and we took these scans of different kinds of tables in the IKEA showroom. And then what we did was we rendered synthetic objects, so objects that are rendered using computer graphics techniques. We placed these objects onto the tables that we scanned at IKEA. And the reason why we did that was because we wanted to have objects,and we wanted to have labels for each of these objects. So we wanted to know where exactly the object was, what the shape of the object is, and so on. And IKEA’s tables don't have any objects on it previously. So it makes it easier for us to just go to IKEA and get this stuff.
That’s how we collect the data. And we call this mixed reality data generation. So we mixed real images as background images from IKEA. And we take synthetic renderers from computer graphics techniques and merge them together. And we generate this large, over a quarter of a million labeled examples that contain these labels for objects position, as well as 6D orientation.
I think the key ingredient here is basically if you use synthetic data and you render it, you know exactly where it is and its 6D pose and 3D size. So all this ground truth comes free. However, you don't want your rendered image to look so different from real image. That's why we adapt a real background. And when you combine these two together, you get free annotation, and also you get pretty realistic looking [images].
Wenli: What are some of the real business applications that you can apply this to? Autonomous driving? Robotics?
Autonomous driving, robotics, augmented reality, 3D scene understanding, all of these are potential applications. We see these applications as very broad set of applications, not a narrow set of applications. So just to give you some examples, let's say in the future, we want to have home robots, robots that can go to our kitchen, wash our dishes. This is something very trivial. Every one of us has no difficulty doing this. But our robots still are not capable of doing things like that. And the reason for that is one of the problems that we need to solve to achieve these kinds of capabilities is to have robots that can know the position and orientation of objects that they've never seen before. For instance, let's say you want to wash a plate; you've never seen that particular plate in your life before. But you'd have no difficulty as humans washing the plate. Robots don't have this capability. So one application for the technique that we've developed is applications like this. We can help robots understand where objects are.
And another potential application is in augmented reality. So let's say again, take the scenario where you go to a new environment, and you have objects in this environment. You've never seen these objects before. But you'd like to augment these objects with interesting effects. So let's say, you want to augment this bottle to have a different color. So you need to be able to figure out the 3D position of the bottle as well as the orientation.
From another perspective, you can think that classic computer vision algorithm tends to know how to manipulate specific objects in their training data. However, human beings have the ability to generalize our skill to an object which we’ve never seen before with known category. And our work, especially our representation, enables this generalization. And we can put all the objects of the same category inside the canonical space. And we'll learn everything inside this normalized canonical space so that when we see new objects, we still know this is comparable to something I saw in the training, so that my skill learned in the training can easily propagate to the test case. Basically, manipulating and understanding all the daily objects can be seen as an application of our work.
Wenli: What are some of the challenges that you're facing right now?
So one challenge, as I alluded to earlier, is data. So right now in this particular work, we showed our algorithm work on six different object categories. So we showed mugs, bottles, laptops, and a few other household table top categories. The challenge really is generalizing this to all kinds of object categories, coaches, cameras, many different object categories. And in computer vision, people generally know that there are between 10,000 to 30,000 object categories that we deal with on a daily basis. So going from six to a large scale of 10,000, that is going to take a lot of work.
Wenli: Is that one of the challenges that the entire computer vision industry are facing?
In computer vision, this is called generalization. Generalization is something that’s a very important part of what we try to do in computer vision, generalization to unseen objects, generalization to unseen categories. So can we do the learning on couches, but get it to work on chairs? So these are really important generalization problems that we deal with in computer vision. I don't think we are there yet. And I think it's going to take us a while to get there. I do see that as one of the big challenges. If you want to solve this using supervised machine learning techniques, data is going to be a big challenge.
Data challenge actually has another aspect. For the task we focus on, we need to supervise using a lot of synthetic objects. It’s really hard to do 3D expensive annotation for all the real objects. That's why we choose synthetic objects. However, this does introduce a domain gap. Basically the rendered objects still look a little bit different from realistic looking objects.
Wenli: How do you close this gap?
In this work, we try to train our data set with a small amount of real data set and a lot of images from Microsoft COCO dataset, where we don't have annotations, but our network learns to know the flavor of realistic looking image. But there is still a long way to exactly close the gap. And this is called domain adaptation in computer vision literature. I think it's also something we need to improve.
Wenli: Computer vision has been receiving so much attention lately, because now you have so many successful research [projects] going on. So do you think that the industry is following up?
I mean, as you already pointed out, computer vision is receiving a lot of attention these days. CVPR this year has 10,000 attendees, so it's really huge. I've been attending computer vision conferences for more than seven years now, and I've kind of seen the growth of community. There's a huge difference in terms of the number of paper submitted, in terms of the number of papers accepted, and in terms of the number of attendees.
Wenli: Do you know the numbers? How many papers were submitted and accepted?
I don't remember exactly for CVPR, it’s probably going to be announced tomorrow. But I think there are over 1000 papers accepted this year, and probably over 5000 submissions. So there's a lot of interest, and a lot of people are putting their energy into getting really good quality of work out there. So it's been receiving a lot of attention. And of course, there's also a lot of commercial value, so you see a lot of industry partners help the community. There're a lot of industry booths here, there's the expo with a lot of industrial companies visiting. So there's a lot of interchange between academia and industry.
Wenli: Yeah. But there are also some problems, just like data if you want to commercialize, right?
Exactly. So data is one problem. I think there are two sides to that problem. If you want to use academic data sets for commercial purposes, there are challenges. And commercial datasets are not usually public, so we can’t use it to improve existing state-of-the-art algorithms. There are of course exceptions to that. But nonetheless, there’s of course kind of friction between what problems industries are working on and if they can make this public.
Wenli: Are you currently working with any industrial labs?
We do have constant collaboration with many industrial labs at the top tech companies. We have friends at many of these companies and we enjoy collaborating with them. So yes, we do have those collaborations.
Wenli: Well, thank you so much for joining us.
Thank you so much for having us.