Chen Wang is a student at Shanghai Jiao Tong University and was an intern student at Feifei Li's Stanford group. His paper "DenseFusion: 6D pose Estimation by Iterative Dense Fusion”, which proposed a new framework for estimating 6D poses from a RGB-D image.
This episode is a live recording of Wang presenting his paper during the CVPR poster session. He discussed his project in detail and how his proposed method is different from previous ones in the field.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Hello everyone, my name is Chen Wang. I'm from Shanghai Jiao Tong University. This is the work I did at Stanford as an intern student.
So basically, for the robotic manipulation in a lot of situations, we want to know where the object is and what its pose is so that we can do grasping and packing, all of these kinds of manipulation tasks. So this is the goal of this work.
And the major contribution of this work compared to the other previous methods is that we found that when we want to uncover the 6D pose of the objects from the RGB-D input, we just found that a lot of points might be occluded by the other objects. So this will cause a large performance drop when occlusion happens, mainly because in some prior works, people did 6D object estimation from the global feature. However, when occlusion happens, a lot of parts will be occluded and affect these global features. So here we generated our pixel-wise dense fusion pipeline, where as you can see here, we treated RGB and depth in different channels, which are also encoded for RGB to generate a pixel-wise color embedding, and geometry embedding from the PointNet structure. Then we just use this correspondent relationship between the RGB and dense, we can do pixel-wise fusion in this level, so later we can do the prediction.
So basically the major difference between our work and prior work is we can do this for pixel-wise prediction from those points which are not occluded by other objects. So as we can see here, this is one of our results in YCB-video dataset. And you can see the x-axis refers to how heavy the occlusion is, and the y-axis refers to the accuracy. As you can see, although the occlusion is larger and larger, but our results are more robust and stable, compared to the prior works. And we also applied the state-of-the-art RGB-D pose estimation method in these two datasets, and for more information, you can check our code; we've released all of our code and information.
So in this 3D video, there are two parts. The first part is we go over the framework, which you can see in this poster. And here is our testing results on the YCB video dataset. So this is our result, and these are the results from prior works. As you can see, we have much more stable and accurate pose estimation results in heavily occlusive scenes. And this is the first part of this video. The second part of this video is how we use this trained model to test on real robot grasping experiment, as you can see here, this is the robot view. Here we run our DenseFusion pose estimation result and then back-project the model points back to the image frame; as you can see here, this is the estimation result, most of the points are aligned with its appearance. And later this robot can know where the object is and its pose, then it can use this pre-defined grasping policy to grasp these objects, as you can see here. And later this robot will grab these five objects and put these five objects into this box. So this is a traditional application in pick and place, in assembling or some factory scenarios.