Sudhakar Kumawat is a Research Scholar at the Indian Institute of Technology Gandhinagar and is advised by Shanmuganathan Raman. His paper "LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks" proposes a Rectified Local Phase Volume (ReLPV) block, a better alternative to the standard 3D conventional layer.
This episode is a live recording of Kumawat presenting his paper during the CVPR poster session. He discussed his project in detail and how his proposed method was able to achieve state-of-the-art results in his research and testing.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
I’m Sudhakar Kumawat. I'm from IIT Gandhinagar and my professor is Shanmuganathan Raman. This paper is about “LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks”. What we’re doing here is that we all know that the standard 3D convolutional layer has some problems. They are computationally very expensive, because there’re so many parameters. They are high memory and computationally dependent.
So what we are doing is we are basically proposing a space-time complexity, basically using the space-time complexity of 3D CNN networks. Let's say this is a 3D input, in the standard 3D convolutional layer, if you take a 3D filter, you’re doing a 3D convolution like this in a sliding window fashion. In this sliding window fashion, we’re only detecting these low frequency points. After that, we separate them into real and complex parts, we have 26 of these 3D feature maps. And we’re convoluting them along the channel exit, and we parse them into some activation functions. After that, we try to learn the linear combination of these channels.This is the simple compact of this architecture.
What is interesting about this work is that this whole part is non-trainable, there’s no training taking place here. The only training takes place here. If we look at the number of parameters, we’re seeing the c · 1 is the number of input feature maps and f · 26 is the output feature maps. So we’re reducing the number of parameters quite a lot.
We’re comparing our layer on with the 3D CAD model on the action recognition dataset. Here we’re comparing with the baselines. we’re calculating based on the ModelNet40 and ModelNet10 compared to the baselines.
Here we’re comparing with the action recognition dataset, we’re comparing with the state-of-the-art, and here we’re comparing with the baselines. You can see that we can get much better accuracy without increasing any number of parameters of the model, although among the operations there will be increasing because of this operation. Here we’re playing with the number of feature of our Network all abide the layer. And the data here is consistent with the standard 3D convolution layer. Here we’re showing how our network compares with the baseline network. These charts are representing these data.