This episode is a live recording of our interview with Yogesh Rawat at the CVPR 2019 conference. During the interview, he shared the current status of capsule networks for computer vision and some of the problems he is currently facing.
Rawat is a postdoctoral researcher at the University of Central Florida. He presented the tutorial titled "Capsule Networks for Computer Vision" at the conference this year. It focuses on training capsule neural networks using a routing algorithm. Rawat received his PhD in computer science at School of Computing at the National University of Singapore
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Wenli: We have Yogesh here with us. He's a postdoc researcher at the University of Central Florida. He's also an organizer of the tutorial “Capsule Networks for Computer Vision”. Thank you so much for coming here to join us.
Thank you for inviting me.
Wenli: Tell us a bit about this tutorial that you organized. How did you pick the speakers and co-organizers?
We did some recent work on capsule networks. We have really good results in different problem domains. Then my supervisor at UCF, Mubarak Shah actually suggested that we should organize this tutorial because this is a really exciting area and there's a lot of attention from the research community. Then we discussed this for a while, and it turned out to be a good idea, then we submitted a proposal to the conference.
Our co-organizers who published all the breakthrough papers which have been done in this area. We tried to contact them mainly via email. Because most of them were not available to see the authors and those work; some of them agreed. So it was good. We got positive responses from some of them, then we built a team. One good thing was, we have done a lot of work on capsule networks in UCF, so most of our speakers are from UCF. We got one collaborator from University of Toronto, Canada. They did very initial work in this area, those works are really popular, that was a breakthrough. And they agreed to collaborate. That was good.
Once we had the collaborators, then we decided on the program, like what content we should cover, and which problems or different domains we should talk about. That’s what came up with the schedule. So that’s the preparation for this tutorial.
Wenli: What is it about?
This is about capsule network, it's really very recent. It was the first research work which talked about this idea. It came out in 2011, so it's not very old. It was quite new, but there were some issues, so it didn't take off at that time. It took five to six years to form these nice ideas and then we started seeing some good papers in 2015, then there was a big breakthrough in 2017. Since then, there has been a lot of research in this area.
The idea is, we saw breakthroughs in standard convolution neural networks starting from 2010, so this area of deep learning actually flew off. Capsule networks in these were also proposed at that time, but the training of these networks were not that easy. CNN are everywhere these days. They talked about deep learning explanation and neural networks. Because of that issue, it wasn't that popular, but now I think they are picking up. These are different from standard convolution neural network, these are more intuitive. And idea is, in standard convolution neural network, what we try to do is we just try to get activations better, we have some features presented in the data.
But here is the difference. Instead of just single activation, what we do is we actually group those activations together and the entities are present in the input data. So the difference is, instead of just saying that whether this entity or this visual entity is present in the input data, we also describe the different properties of that entity. So that's why we grouped those neurons together, they represent different parts of the object, or in different activations or different properties. And apart from that, we can see that whether that entity is present or not. So that's really different from the standard CNNs. It's quite intuitive.
Wenli: What type of researchers will benefit from this tutorial? I guess I'm saying that how do you help senior researchers? What can you provide them?
I think one difficulty many researchers are facing is how to train these capsule networks. That's another motivation to do this tutorial. Initially, this capsule network was proposed for images, and the idea was tested on very small resolution images, for example, 20 by 20 black and white images. When people were trying to scale them for bigger or high dimensional data, such as videos or big resolution images, they didn't get good results. So we were the first to apply these capsule networks for videos, which is high dimensional data. We did it successfully, we were probably lucky or we chose the right path or we had the right ideas. We wanted to share that with the community by doing this tutorial.
Wenli: That’s very nice of you, doing something really good for this community. From your perspective, what’s the current status of the capsule networks for computer vision? Where are you at this stage?
From computer vision’s point of view, it has been applied to different problem domains. For some of the domains, we have seen very good success. For tasks like image classification, we have good results when the size of the dataset isn’t that big. And then we have seen good results in object segmentation, we have seen good results in some entity localization in the video.
The reason for that is when we represent those entities using a group of neurons - we call them capsules - those capsules are representing entities, so that's like interesting CNN. That's why when you represent that entity, it's actually easier to segment or track those entities. Specifically, these two tasks are really showing very good results. I hope this can expand to other problem domains as well.
Wenli: What are some of the breakthroughs that you and your team have solved?
To me, in this area, the biggest breakthrough was the routing algorithm that people came out with in 2017 across 2018. That actually enabled us to train these capsule networks. Before that, it was not very easy to do that. The idea was there, the intuition was there, but we didn’t know how to train such networks. So it was introduced.
The next breakthrough I’d like to say is, how we can scale that routing algorithm for high dimensional data. As I mentioned before that we did for videos, it was never done before that. So I will say the biggest breakthrough was how we can scale this thing. There were basically two main ideas we proposed. One is capsule pooling, where we actually reduce the number of capsules we have, because it would affect our routing algorithms. If you have too many capsules, routing is infeasible to do. And also with too many capsules, it's hard to train, they cannot fit in the memory. So we had two main ideas, in which we use a number of capsules, and the other was like sharing of weights or the transformation mechanism that is between the capsules. That actually brings down the memory consumption and execution time of the algorithm. I would say this is the big breakthrough, which is scaling these capsule networks.
Wenli: Awesome. As a researcher, what are the problems that you're focusing on right now? What's your future path in this area?
Right now, I'm mainly looking into video analytics, like understanding how we can infer different kinds of problems in video domain. And my main focus right now is activity recognition. If we have a video stream, can we detect what kind of activities are happening in that video?
And one specific problem which I'm looking at right now is, if you have videos from multiple views, then actually it affects a lot. As a human, we can easily tell that means this is the same activity. But for computers, if we change the viewpoint, it changes a lot, so it's very achievable. That’s one area.
The other area that I’m focusing on is semi-supervised learning. For videos, I talked about activity recognition, getting labels is very hard. And you can get video-level labels. But then when you have to get pixel-level labels, since there are too many pixels in the video, it’s not that well polished. So what I'm looking into is how we can actually manage to train our networks using very few labels. We do have a lot of videos which are not labeled, so how we can make use of those and combine these two to perform semi-supervised learning? Those are the challenges. Right now, most of the research is being done in supervised learning and we are seeing very good results. But now I think the next steps will be like how we can do this in semi-supervised, unsupervised training.
Wenli: Alright, thank you so much for coming to our platform to share with us.
Thank you very much for inviting me.