Updated: Feb 15
This episode is an interview with Josh Tobin, a former OpenAI Researcher, discussing highlights from his paper, Geometry-aware Neural Rendering, which was accepted as an oral presentation at NeurIPS 2019 conference.
Josh Tobin is a researcher working at the intersection of machine learning and robotics. His research focuses on applying deep reinforcement learning, generative models, and synthetic data to problems in robotic perception and control. He did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel and was a research scientist at OpenAI for 3 years during his PhD.
Interview with Robin.ly:
Robin.ly is a content platform dedicated to helping engineers and researchers develop leadership, entrepreneurship, and AI insights to scale their impacts in the new tech era.
Subscribe to our newsletter to stay updated for more NeurIPS interviews and inspiring AI talks:
Paper At A Glance
Understanding the 3-dimensional structure of the world is a core challenge in computer vision and robotics. Neural rendering approaches learn an implicit 3D model by predicting what a camera would see from an arbitrary viewpoint. We extend existing neural rendering to more complex, higher dimensional scenes than previously possible. We propose Epipolar Cross Attention (ECA), an attention mechanism that leverages the geometry of the scene to perform efficient non-local operations, requiring only O(n) comparisons per spatial dimension instead of O(n2). We introduce three new simulated datasets inspired by real-world robotics and demonstrate that ECA significantly improves the quantitative and qualitative performance of Generative Query Networks (GQN). [presentation slides]
Wenli: We’re at NeurIPS 2019 with Josh Tobin, He is a former researcher at UC Berkeley and OpenAI. Nice to meet you and thank you for joining us here.
Great to meet you as well. Thanks.
Wenli: You're here because you have a paper that recently got accepted. Congratulations!
Thanks so much.
Wenli: The paper is about “Geometry-aware Neural Rendering”. Can you introduce what the paper is about?
The goal of the paper is, we want to help robots understand the scenes in the world that they're interacting with. Typically, the way you do that in robotics is, you have some state representation of the world. It’s things like, where are all the objects, what poses the robot in, where's the robot, etc. The challenges are that those types of representations of scenes are really difficult to scale to more and more complex scenes if you have a lot of objects and the objects themselves are really complex.
The topic of the paper is on doing implicit scene representations. What that means is, if you take some observations of a scene - imagine some camera images that render the scene from different viewpoints, then you want to train a model that can have some understanding internally of what's happening in the scene - the way we do that is using a formulation called “neural rendering”. The way that neural rendering works is, you train a neural network that takes as an input one or more viewpoints of the scene - the camera is looking at the scene from above, from the left and the right. And the goal of that model is, given some other arbitrary viewpoints, like over here where it's never seen the world before, to be able to accurately render what the world would look like from that viewpoint. If you can do that well, the intuition is that, internally the model has to have some representation that understands everything that's happening in the world.
Wenli: Besides of this research area that you're focusing on, what the world of industry is doing right now towards this problem?
I don't think that this is a super-explored problem in industry right now.
Wenli: So this is one of the bottlenecks that we can probably solve.
Josh Tobin: Yes. I see this as being a long-term interest in industry. Because for me, the way that this fits into the broader picture of doing robotics is that, one really useful technique for training robots that can work in the real world is whether you can take advantage of using simulated training data for those robots. Simulated training data is cheap and scalable, you can generate infinite amounts of it. So if you can use that data, then it's really useful.
But one of the challenges is that, it's difficult to construct simulations of the world. The general direction that this research goes in is, can we make it easier to take a small number of observations about the world? For example, I see a couple over here and another couple over there, can you use that information to construct a simulation? So that's the direction this work is going into, and that's how I see it fitting into the broader picture.
Wenli: Is that the main contribution of your paper here?
The main contribution of my work is, there's a previous work that I'm building on that works on this problem. And the main thing that all work contributes is that, we add an additional attention mechanism, like an additional neural network primitive, an additional part of the structure of the network. And that attention mechanism leverages some information about the 3D geometry of the scene. It leverages some facts from classical 3D computer vision in order to do the search process to look for relevant information in a way that's much more efficient.
Wenli: You're trying to solve the limitation that we have right now with robotic. Would you mind sharing what is the limitation that the current research is facing?
I talked about this broader goal of using data to construct simulations, and this work is pushing in that direction, but it's a very early step in that direction. For example, this work only has been tried in relatively simple scenes. It so far only works in simulation, and it only works for static scenes. So we’re understanding a scene at one slice of time. But in the real world, scenes evolve and robots move objects and objects interact with each other, and physics happens. So I think this work doesn't address any of the complexity that arises in the real world.
Wenli: What's your next step?
I'm not sure yet, what I'm going to do next on this research. I think there's a lot of possible directions to go with it. One really natural one is to try to apply this type of research to real world data. So I think that would be an interesting direction to go with this. And I think another would be to try to apply this line of research to dynamic scenes, to scenes where objects move around and interact with each other.
Wenli: That’s really interesting. I'm really excited for you and your future.
Me too. I think it should be really interesting.
Wenli: Thank you so much for joining us here. And if you're interested, please go read Josh's paper or contact Josh directly.
Thank you so much.