Updated: Jun 16
This episode is a live interview with Deqing Sun, a research scientist of Google, at CVPR 2019. Deqing introduced the tutorial he organized on Deep Learning For Content Creation, the challenges he is currently facing, and two papers accepted by CVPR that he co-authored.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Wenli: We have Deqing Sun here. He is a research scientist from Google. And he's also an organizer of the tutorial “Deep Learning for Content Creation”. Thank you so much for coming here to share with us.
Thank you for inviting me.
Wenli: What is this tutorial about?
Actually, it's a very interesting tutorial, it’s deep learning for content creation. You know deep learning a lot, but for content creation, it’s maybe a little new to this conference. Content creation is actually more related to, for example, making movies or creating games.
Wenli: How can deep learning help with that?
That's a very interesting angle in which we are looking at the problem now. So in the past, for example, if you want to create a movie, you need a lot of artists. You have to do a lot of iteration, a lot of manual work. But gradually, especially more recently, people find that you can do a lot of hard jobs using a deep learning-based approach.
One simple example is video interpolation. In the past, if you want to make a movie, you have to make every frame from beginning to end. But with a very realistic video interpolation algorithm, now we can just make a few keyframes, and then you let the algorithm interpolate what's happening in between.
Wenli: Where does your algorithm get the picture data?
That's a very good question. We can start from some high frame rate video as training data. Because from the high frame rate video, you can get a low frame rate video. And then you can let the algorithm learn the mapping from the standard video to slow-motion video. And then you can apply the same algorithm to make movies or other gaming situations. The algorithm can learn to interpolation. So that’s maybe a commonly used technique for a movie studio, but now we can do it very well with deep learning-based approach.
Wenli: That’s so interesting. While you were organizing this tutorial, what kind of speakers and co-organizers that you picked to help you?
For this conference, the co-organizers and the speakers are more technical based. For example, we have three authors of the pixel-to-pixel and a cycle gun paper, which kind of became a classic in the field. And their work actually has already made a very big impact on content creation.
Wenli: What are some of the things that the junior researchers can take away? What are some of the things that senior researchers can take away from this tutorial?
For junior researchers, especially for starting graduate students, I think the field is now in a very exciting phase. So there are lots of problems you can work on. Our feeling is that maybe it's better to first train yourself, to begin to get yourself well equipped before you can do one or two projects. Just train yourself to get state-of-the-art techniques. And then you can begin to look at problems you can make a unique contribution, instead of following what others are doing. Just pick your unique angle.
I think for senior researchers, there are different challenges. Because as the field is making very rapid progress, we are facing not just the technical challenge, sometimes the problem is more societal. Because as a technique becoming so advanced, sometimes the videos and images being generated are so realistic. It's hard to tell the real from the fake. Then it may cause a lot of problems for society if people make bad use of these generated images.
Wenli: About how you use this technology, right?
Yeah, I think that's not just a technical problem. That may rely on the whole society to define what might be a good solution. But as senior people who are working on these problems, I feel it's our responsibility to think about the potential impact and actively think about the solutions, also draw attention from the whole society to these potentially serious problems.
Wenli: I know we just talked about videos and movies. What are some of the other business applications that you can use?
For example, an advertisement. If you want to make an advertisement -
Wenli: The cost would be a lot cheaper, right?
Now you need a lot of artists and designers. If they could speed up the process, the cost and the quality will be greatly enhanced.
Wenli: Any other breakthroughs in the study?
I think the other breakthroughs may be in some interesting areas, like 3D. It’s a very interesting direction. You can generate interesting 3D models. AI is very data-hungry, if you can generate realistic data and also the ground truth, you’ll be providing data. For example, for autonomous driving, actually people begin to rely on synthetic data to pre-train the models. But if you can make the synthetic data very realistic, you can significantly close the domain gap.
Wenli: Yeah, that’s another way to solve the current problems that we're facing. What are some of your challenges?
I think technically, there are challenges about how to render things more realistically. Another challenge is, now we know some methods that begin to work, but we don't know why they work. So we need to dig deeper into these models and understand why they work so that we can interpret the behavior of the models.
Wenli: Does it just happen to work at this moment?
It may not just happen to work. We may have some intuition that it would work, but we don't have a deeper understanding.
Wenli: What are some of the criteria that you use to evaluate whether this is a good fake video? What are the criteria?
That's a very good question. I think another challenge is, how can we set up a good benchmark to evaluate the progress? Now because the field is making so much progress, and somehow people just rely on video inspection. For example, if the video looks good, we already feel very happy about such progress because we were not able to do so. But in the future, if we want the field to become more scientific, we will need more benchmarks to evaluate what real progress is. That's a very good point.
Wenli: Two of your papers were accepted by CVPR this year. What are they?
There are two papers addressing different problems. The first is called “Pixel-Adaptive Convolutional (Neural Networks)”. It's actually a generic operation that generalizes the convolution operation for standard scenes. And also the classical technique called “Bilateral filtering for image processing”, which generalizes these two operations. One drawback we found for standard convolution is basically you apply the same operation on every pixel, regardless of the image content. But if you want to do some more intelligent processing, for example, if you look at the scene, you might want to apply different filters or features to the glass or to the floor or to the sofa. This is something we found lacking in the standard convolution neural network.
So we introduced this pixel-adaptive convolution, where the convolution of the filter width will be adaptive to the image content. And we found this will make the neural network much more flexible. And we've found it can be applied to quite a few interesting applications, and improve the performance of standard conventional neural networks. We think it will have wider applications.
The other paper is called “Competitive Collaboration (Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation)”. We introduced a framework that can learn depth, camera motion, optical flow, and motion segmentation from data in an unsupervised way. And the method achieves state-of-the-art technique on four tasks. I should say proudly that’s a very amazing achievement. I feel very happy for the student intern.
What's very interesting is not just because of the results on these four benchmarks, it’s more about the framework, where we call it “competitive collaboration”, because it introduced a mechanism to learn several related neural networks. During different training phases, the neural networks will compete against each other. At different phases, they will collaborate. That’s why we call it a competitive collaboration. They are playing games. Sometimes they collaborate, sometimes they compete. But the overall goal will be for the overall objective to be optimized to achieve the best result. That will train several related neural networks, not a single one. We found it can be very effective.
Wenli: What are some of the future applications that you can use these algorithms?
I think for the pixel-adaptive network if we don't consider computation cost, we can replace the standard convolution layer with the pixel-adaptive convolution and obtain improvement. There will be many applications that we can apply to. And for competitive collaboration, first the tasks of depth, motion estimation, camera pose estimation, and motion segmentation, these are all applications. Also, we can apply these for unsupervised feature learning. One analogy I would like to draw is infant learning. In the beginning, infants just look around the world and there is not much strong supervision. So they can do unsurprised or self-supervised learning to build some features. And then later if you tell them that this is a tree, they will quickly learn to recognize the tree from very few examples at the stage.
Wenli: Thank you so much for joining this platform and sharing with us your experience and opinions on this.
Thank you for inviting me. It's such a great pleasure to talk with you.