Updated: Aug 26, 2019
This episode is a live recording of our interview with Jason Corso and Brent Griffin at the CVPR 2019 conference. Jason Corso is a professor at the University of Michigan and is a recipient of the NSF CAREER award (2009), ARO Young Investigator award (2010), Google Faculty Research Award (2015) and on the DARPA CSSG. He is also the Co-Founder and CEO of Voxel51, a computer vision tech startup that is building the state of the art platform for video and image based applications. Brent Griffin is an assistant research scientist for Robotics at the University of Michigan.
Their paper "BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames” presented at CVPR 2019 focuses on addressing the problem of selecting the best frame for user annotation during video object segmentation. During the interview, they shared important contributions and key applications from the paper, as well as Corso's startup Voxel51.
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more CVPR interviews and inspiring talks:
Wenli: We have Jason Corso here. He is a professor at the University of Michigan. And we also have Brent Griffin, an assistant research scientist at the University of Michigan. We're here to talk about their newly submitted and accepted paper, "BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames”. Congratulations, and thank you so much for being here. Let's talk a bit about your paper. What is it about?
Sure. So with the introduction of really high quality datasets, like DAVIS(Densely Annotated VIdeo Segmentation), we've seen a lot of new algorithms around the problems, the object segmentation, coming into places, CVPR and other conferences. And among the video object segmentation methods that we've seen, semi-supervised algorithms by far have the best performance. And the way that these algorithms work is we personally provide an annotation frame, we will provide the boundary of an object that they want to segment in the rest of the video. The semi-supervised algorithm will be trained on that annotation frame, and then segment the object for the rest of the video. And what we've seen is that even though the current paradigm is just used for stream annotation and all these dataset benchmarks, the performance actually changes dramatically, depending on which annotation frame I use. So given that, a user is going to annotate a frame using these algorithms. You think you should use the best annotation frame possible. So you have the same amount of user input, but you have better performance. And so this work focuses on learning to select that frame, automatically from an untouched video, making the best use of a person’s time.
Wenli: So I'm guessing that would be the biggest contribution of your paper.
Yeah, I think for a lot of computer vision algorithms, one thing is kind of identifying the problem. Another thing is figuring out how can we actually have something to learn to solve this problem. So part of the paper was just identifying that. There are all these different frames you can select. Here's the ceiling on how great performance can be if you select the best annotation frame. Here’s the floor, how poor segmentation is if you choose the wrong segmentation frame. What we show is that the first frame performance is closer to that floor than it is to the ceiling. The next part of it is figuring out how can we train an algorithm to make these selections automatically, and essentially, a lot of innovations in BubbleNets are figuring out how to make that work.
And I think if you put it into context, that computer vision community hadn't really asked the question of what frame is the best. It was just sort of accepted that we can use the first frame because it's available. So this paper also asks that question: how can we best leverage the available human for training the best model to work in the dataset?
Wenli: Was that what triggered you guys to work on this paper? You realized that there's this area that nobody has been really working on?
I was actually motivated by a very real problem. I’m working on a media forensics project, and we are providing software to a company called Par (Par Government) so that they can generate manipulated videos automatically. And part of how we made this framework work is that you could remove these objects, and essentially, just from the user’s annotated object, had a process that would segment it throughout the entire video, and then go through and imprint that object out of the video. And the motivation for this project is trying to learn or see how well they can automatically detect the media that has been manipulated. And so Michigan was part of the team that was trying to create videos that were difficult to detect. So this is about high quality manipulations, and we know exactly how the manipulations occur.
Wenli: How do you set the benchmarks and criteria to define how good it can detect? Because it's such a new area.
In this particular problem, we started out using the regular paradigm of just using first-frame annotation, but what we found is when you're trying to remove an object from video, you generally start from right when the object is entering the video, and then maybe it walks across the camera, something like this. And that annotation of an object coming in the view is actually very poor, and we found that even just using the middle frame works well.
But it was hard for people on Par’s side in figuring out which annotation frame to select, and they asked us if we could automate the process. And this seemed like it was insane at first, like the idea that a computer vision algorithm could just select which annotation frame we use automatically. But we were able to leverage the DAVIS dataset and find a way to take the dense ground truth of the annotations that they've had in that dataset and convert them in a way that we have 750,000 training examples from 60 original videos that we can use to train BubbleNets, and then make these selections for annotation.
That was a hard part of the problem. How we can take advantage of this very limited set of examples of annotated videos, and how we can take these and generate training examples at scale, how we can do something that's going to actually interpret a video, which involves a lot of parameters that you can easily overfit on your training set if you don't have a lot of training examples.
And I guess that's another way of looking at how BubbleNets work. So the naive thing that we could have done, which we did first, is like here’s one of the frames in the video, which frame is best to annotate, and just go direct regression on the frame quality, if you will. And there was just not enough training data for that. So instead, we actually adopted a very old computer science algorithm called Bubble Sort in our mindset. We instead looked at pairs of frames, and we just made a comparison, which of these two frames is better? And so by doing these pairwise comparisons in the context of the rest of the video, that's how we dealt with a much larger training dataset by marrying modern deep learning with very old all sorted.
Wenli: Is that the biggest breakthrough in your paper?
Yeah. In addition to motivating the problem, and showing how much performance can change based on which frame you select, formulating this in a way using that in bubble sorting framework that you can actually make this a meaningful machine learning problem or deep learning problem that we can actually solve. And I think this is just the beginning. It is by no means suggesting that BubbleNets is just the end of this, or is the peak performance, there's still performance left to be had. We hope that other researchers can kind of get behind this. So basically it’s a great opportunity to one, just let them know that don't use frame annotation, stop doing that. And two, help us figure out how can we solve this problem by automatically selecting the annotation frame, and BubbleNets is a meaningful first step in our process.
I think it's just a beginning for the video object segmentation community. But it's also a different way of looking at problems in some sense. So this is an instance of a research direction we have in our lab called “hybrid intelligence”, which is really trying to best marry humans with computers in a way to leverage the best of both worlds. So in BubbleNets, we asked human to label one frame, and in another paper we had a couple of years ago, it instead asked the human to click on one specific key point of a vehicle. And so we're trying to develop models that have the computer algorithm model to learn how to best leverage the available human at inference time, which is something that I think has great headway to bring practical innovation of AI.
Wenli: Yeah, definitely. Like you said, this is just the first step. But at this point of time, do you see any business applications that's been going on or could use these algorithms?
So the object segmentation in general, I think can be very useful in robotics, also in autonomous vehicles. And even in our lab, we have robotics work in a separate project, where we use video object segmentation and visual servo control. And we're able to track objects down very quickly using all this machinery that are developed in computer vision effectively and control application robotics. That work is with Toyota Research Institute, and they're very excited about the progress that we've been able to make. So that's just one example, even within our own lab, and I think that using these fancy annotations, machines can generate very quickly and very efficiently, which will have a lot of applications that just maybe hasn't been used in traditional robotics compared to more 3D RGB-D type vision.
And I think we'll see a lot of interest from a more socio-political mindset as well. As Brent was motivated in the original problem, which was how to detect whether or not an image or video has been manipulated, so this is a tool that helps manipulation. The only way to improve our ability to detect manipulation is to improve the understanding of how to manipulate the video. Even on the flight over here, I was reading an article about the impact of potentially regulating video prior to the upcoming election a couple years ago.
Wenli: Exactly. They're making fake videos of presidents’ talk.
Yeah, it is intensively relevant.
Wenli: But it’s only left for society to decide what's good and bad.
When manipulating video, for example, we leave artifacts behind because it's by nature a problem, we are changing a few pixels. So I think it's not given yet that video and images will be able to be manipulated perfectly, nor that we wouldn't be able to detect them when they are. It's an important problem, both from scientific and social point of view.
Wenli: Exactly. But when the video is not that clear, which is highly compressed, will you still be able to define the fake videos? Is that going to be a challenge?
There're lots of different ways to see the videos or images that have been manipulated. And one of those is kind of what Jason's talking about, trying to find these artifacts that are more involved with. If we miss a little bit of an object when we're segmenting it, then we are going to imprint it, there might be just a little bit of a sole of a foot, it's walking behind or a shadow, if that wasn't included in the segmentation all the way through. So that's a physical artifact that you can see in the video.
But they're also things like double JPEG compression, where the semantic information about the video that says, it's playing this video happened on this day at this location. What we know from other data available is that, that can't be possible based on weather patterns, or GPS information of vehicles, this kind of thing. So actually there are a lot of ways that you can disprove if a video is real or not, not to mention if you get the original raw video and compare it with the manipulated one, you can at least establish there is a dispute.
Wenli: What are the other challenges that you are facing right now in this topic?
We are continuing the direction in hybrid intelligence. There are many questions that could be asked here. A related project that we're working in the lab is the ability to take a dash cam video that was acquired and posted on YouTube. They are rare traffic events. In autonomous driving or with ADAS, we need a lot of data to train models that can handle different situations, but for rare traffic events like a near miss of a pedestrian accident or an actual accident happens in every tens of thousands of miles. So it's important to have that data.
So we are working on project that farms that type of data from YouTube, which is abundantly available. But it's only monocular data, so we can't do the full reconstruction of it directly. We need that for simulation, for training. So we are working on a project that is asking humans to basically answer questions that help us do that 3D reconstruction on the fly. So can you tell us what's the make of the vehicle? We can't see it from an automatic method. Or can you draw a line from the front right tire to the back right tire so we can estimate the size of the vehicle? Or in this video, we will automatically draw a box around the two vehicles that are in the accident, then see how it works for example. So there are many questions about how can we best leverage humans in the inference and processing loop.
Wenli: But it's also exciting. This field is both exciting and challenging.
Jason Corso: Yeah, for us challenge is excitement.
Wenli: That's the right mindset. You're a scientist. I know that you just started your startup, Voxel51 since December 2016, and you launched the “AI for Video” platform and the Scoop product just now at CVPR 2019. Tell us more about this startup.
Sure. We began technology development in 2016, and we envisioned a software platform that will enable both computer vision experts as well as non-experts to leverage the advances that we're seeing in computer vision and machine learning, at a scale that's very hard to do well. We incorporated in October 2018, so we had about two years of technology development prior to becoming a company in fact.
And we are here at CVPR launching the first version of the platform. It has three main use cases. The first use case is, say you are a computer vision expert or machine learning expert, you can build a model to do video object segmentation for example, but you don't know much about the backend, how to turn that into a system that can scale to process thousands of hours of video in a day, for example. So instead of learning how to use the major cloud providers’ platforms, which are hard to use and taking a lot of training, you can take Voxel51’s SDK and basically drop your model into the SDK and deploy the platform within a few hours.
Wenli: You provide a Red Bull to the scientists.
In some sense, Red Bull for video, I love it. So help them to achieve their goals faster. We have one company in the privacy space called EgoVid. They take their analytical method, which inputs video and outputs new video that changes the identity of the face. It maintains the facial expression but changes the identity of the face for privacy concerns.
Wenli: Just like one of the “Black Mirror” episodes, kind of scary if we think deeper.
They developed the computer vision technology, but they don't have the resources to do backend work. So they basically deploy on our platform. That's one use case.
The second use case is, if you have a lot of data, but you don't necessarily have the resources to pay the human annotation companies, which are abundantly available these days. They're very expensive, one frame of labeling is about 30 cents or 50 cents. Instead, you can deploy your video onto the platform and then connect it to what we call our “senses”. So we have vehicle sense, road sense and person sense. And these do rich type of labeling and annotation. For example, the vehicles, we can do vehicle make, vehicle type, the color, the pose, all with the accuracies above 90%.
Wenli: So you will be collaborating with all the tier one companies that need those data.
Jason Corso: Yeah. We’re looking to do that indeed.
And the third use case is, the platform really is an application for a developers platform. We have built a first application on the platform to demonstrate that capability. That is called Scoop. That's what we are bringing up today. Scoop lets you take large datasets and gain insights into their contents with no training whatsoever. It’s a very easy-to-use interface that lets you quickly do what we call “faceted searches”. So for example, if you're working on pedestrian safety at intersections, you have some thousands of hours of data. Not all of that is useful for training models, or doing accuracy assessments for your autonomous vehicle. So you could use Scoop to basically find only the data on intersections that have at least two pedestrians at a certain time of day or something like that.
Basically any ontology is important. So if you have your own labels, you can upload them into Scoop and play with them and so on. And we use it internally. Generally, we find that our data scientists and computer vision engineers are spending anywhere from three to four days a week manipulating data instead of actually doing the science of model development and this kind of experimentation. We've seen the flip-flop in that ratio. You can spend most of your time actually working with the model instead of manipulating data.
Wenli: So in a startup, there’re a lot of excitement going on, but also challenging work needs to be done.
Yes, it is.
Wenli: Well, best luck to both of you and especially for your new startup. Thank you so much for coming to our platform to share with us.
Jason Corso & Brent Griffin: