Unsupervised Depth Perception

Data Skeptic · Intermediate ·🏗️ Systems Design & Architecture ·9y ago

Skills: CV Basics80%ML Pipelines60%Systems Design Basics50%

Key Takeaways

The video discusses unsupervised depth perception using a deep learning architecture that learns depth and pose information from unlabeled videos, as proposed in the paper 'Unsupervised Learning of Depth and Ego-motion from Video'

Full Transcript

[Music] data skeptic is the official podcast of datas skeptic.com bringing you stories interviews and many episodes on topics in data science machine learning statistics and artificial [Music] intelligence tingu Z received his BS Suma camlau in computer science from the University of Minnesota and then went on to earn his master's degree in robotics from Carnegie melon University tingu is currently a fourthe PhD student at the Bur AI research lab under the supervision of Professor Alexi EOS his research areas include computer vision machine learning and computer graphics with primary focus on learning based 3D visualization synthesis and understanding from 2D images tingu welcome to data skeptic hi thanks for having me I think a good starting point for our discussion would be to learn first about the problem you were solving in the paper can you describe your data set which I believe is a collection of videos and what you wanted to learn from that data the goal of of this project is to teach the computer to understand the 3D geometry of the scen it's observing and it's ego motion during exploration or the environment from watching unlabel videos uh in this particular case we use uh this data set called KD and the citycape these data sets are basically like car driving in urban Landscapes and then capturing the videos during its driving the motivation of this problem setup is to sort of mimic how humans obtain a visual experience for learning about geometry for humans our visual experience solely consists of 2D image streams from moving around and observing the 3D World so our hope is that by letting the computer watch tons of videos and try to come out with a consistent explanation of the visual World it will become capable of geometri understanding like humans do could we unpack that a little bit more what do you mean by geometric understanding like for humans given a single image of say indoors sying we're able to infer like its geometry like the the depth ordering of the objects like we know some objects are in front of us and some objects are far from us even from a single image this problem is actually like mathematically under constraint you want to solve it from a single image the humans are able to do it because we have seen similar SC similar objects from uh the past visual experience that we have built a consistent model of the 3D world are you familiar with the technique forced perspective that they use in Hollywood movies sometimes actually not familiar with that can you tell me little bit about it so Force perspective is a kind of special effect I guess it's it's one of the earliest ones that were used in film making they used it a lot in The Lord of the Rings movies to make the hobbits look smaller proportionally to their you know human actors that may have been more or less the same size in reality you could put two actors on different planes but Orient the cameras so that they appear to be in the same 2D space and then they can kind of act at each other to also be convincing despite one person being you know several feet back and then of course you know you dress up the set and it's a bit like an optical illusion so this is an effect that can fool us our our vision doesn't correctly assess the scene but you could also think of another situation like a golf ball and a basketball being in the same 2D space and having the same diameters but of course you and I have outside information about the world we know golf balls are much smaller than basketball so if relatively appear to be the same size the golf ball must be closer so we see something that an algorithm wouldn't necessarily see if things like forc perspective or at play right exactly so yeah let's basic like a prior with the humans have built from the past visual experience so it's interesting then you and I as human beings we bring that prior to the table and your unsupervised system starts at a disadvantage because it doesn't have the training that you and I have had in our many years of living how does that disadvantage work out does the system system learn geometry in the scenes you've presented it uh actually don't see as a disadvantage for us we also don't have a accurate 3D sensor to you know know the exact 3D geometry of the scene we're seeing we might not even have a very accurate ego motion estimation system in our brand even with this kind of seta we are able to do this job very well we are able to do this by pury observing 2D images we don't have the sensors for accurate 3D geometry measure in our project this is the exact the same setup the computer is only allowed to see 2D images or like 2D videos during training and is able to invert 3D geometry during testing yeah I think one of the more remarkable aspects of the paper is that it's unsupervised maybe to start with can you tell me about the motivation for that design choice in our case the distinction between supervised methods and the the uncivilized method is that in civilized methods you are typically giv deep neural network Works supervision like depth measurements from some laser scanner or connect or like uh e motion signals taken from the car odometry and in our case we don't let the computer have access to these signals and it should learn to do these tasks purely from watching 2D videos yeah the M motivation is one is to uh you know mimic how humans learn from moving around observing the 3D world and second is that this allowed our system to be trained on more flexible setup like more flexible data sets like U potentially in Internet videos where you don't have access to the 3D measurement from certain sensors yeah it does seem like there's a lot more unlabeled data available in the world than labeled data right exactly so of course the training does take place though then can you describe a little bit about the architecture of your network and how it gets feed feedback our approach is largely inspired by the classical computer vision problem called the image based rendering or view synthesis or view synthesis uh in view synthesis typically you given one or multiple views of a 3D SC and then the goal is to synthesiz other views of the same scene from different camera angles right for instance you're given maybe one view of the AO Tower and you want to be able to synthesize what it looks like from different camera goes to solve this problem one first need to know the syene geometry as well as the relative camera poses between the input views and the target View and then based on the geometry and the camera pose the target view can be synthesized by basically projecting the pixels from the input views to the correct pixel coordinates in the Target view in order to do this uh task the The View senses task well uh one is to have a correct scene geometry and a camera POS inference modules so in our work we basic use deep networks as inference modules and then formulate the entire Muses pipeline in an end to end differentiable manner that allows us to trer deep networks to do geometry and pose inference without any explicit labels and purely using the task of uh vies for super Vision let's take a break from our show and talk about our sponsor Periscope data Periscope is the dashboarding tool for teams that lets you rapidly create charts from SQL queries let's be honest with ourselves here if you're a data scientist you probably got some pretty impressive SQL chops right you go straight to the data to get your answer but now how do you visualize your results whether that's a quick histogram or a more elaborate dashboard it's just a few clicks in Periscope data sharing it is just one more click and once you've got it done it's done for good your collaborators can check back for an updated version anytime time no more follow-up emails asking for a refresh no confused people looking at old versions of the data Periscope data solves all those problems for you you can check it out for yourself at periscoped dat.com Skeptics one more time that's periscoped dat.com Skeptics am I correct in saying that essentially you're trying to predict the next frame or maybe one or two frames later and that's where the the loss function can kind of measure its accuracy yeah let's one way to put it basically yeah you try to predict what the SC will look like in other frames based on the predicted geometry and the camera pose so if your estimation of those geometry and Camera pose are correct then you will be able to do these frame prediction very well I think it's a very novel approach I like the way that uh the network is is set up and the architecture that and we'll actually get into some more of that in a few questions but I was curious about um the nature of some of the train data I'm a little bit familiar with cityscape so I've watched some of the contents of that Corpus and I know that the average car driving down the street it is somewhat predictable from frame to frame but there are also unpredictable things like a bird flies through or you make even a left-hand turn kind of introduces a lot more new information than driving straight how does the network function in cases where you know there are moments when the frames are very predictable but other times when independent of the learning the the scene is a little bit more dynamic you mean like when the things have moving objects yeah yeah that's a very tough problem to tackle right now and in our paper we basically have this mechanism for predicting uh whether a particular pixel has a moving object or the this pixel will be recruited in the next frame things like that like we call it the explainability prediction MH so basically we try to let a network not only predict scene geometry and camera posst but also predict whether these pixel could be explained in the next frame or the nearby frames purely using the view synthesis formulation so by factoring these into accounts we could have a you know model bus system like the the network could be trained in a model bus way when there's like you know moving objects or objects being aced dis acced in nearby frames yeah just to double check in case some listeners aren't familiar can you give a rough definition of what occlusion is and why it's such a problem in tasks like this occlusion basically means that so you have some object that you can observe in the current frame and then when you move to a different camera location this object might become aced by another object that is in front of it for instance uh suppose you have two objects one is in front of the other the second object is currently aced many of its pixels are currently not visible from current camera viewpoint but when you move it to for the user uh the agent has moved to the site the pixels become disced because now the depth ordering between these objects have changed then those pixel will become visible again or disagreed yeah it's definitely an added challenge for a computer vision researcher for sure uh well guess in this particular in our formulation of using VI senses as uh the training signal this particularly problematic because we basically D vies formulation assumes that all the pixels in a Target frame will also be visible in a nearby frames mhm we need some mechanism to explicitly model this occlusion or dynamic object effects can you tell me about how that's built into the network architecture so basically we have an additional branch of the network that predicts whether this pixel in a Target frame could be explained by The View syis formulation so here the expendability encodes a lot of factors such as Dynamic objects or cusion or Reflections so all these factors that canot be explained by our mu syesis formulation so by predicting these masks we then can wait the loss function based on this explainability basically when we train the network if it believes that some of the current pixels are not expandable by the real syis formulation then you will not try to get the gradients for training from those pixels so this basically have a robust way of dealing with those unexplainable factors while still being able to tr using the using this formulation yeah I thought that was a really novel approach to to doing this I like that aspect of the paper uh every so often we we bump into problems on this podcast because it's an audio show and we can't obviously have visual so I will definitely direct people to look at your paper which will be in the show notes but maybe for those who can't do that cuz they're driving or whatever can you just describe what the explainability mask looks like when mapped onto an image because I I found that the visuals you had in the paper were very helpful in me understanding what the explainability mask was so in this cityscapes data set it's a car driving data set and you see like pedestrians walking in the streets in our case the explainability mask when he sees like pedestrians walking you will be able to tell that oh these are the dynamic objects that are not expandable by the simple depth based vienes formulation also another example would be like things that are visible in the Target frame but become invisible in the nearby frames because the car has moved so those pixels will be masked as unexplainable as well but we also see some other factors that we do not quite understand while the network decid to master them as unexpandable it's not just doing the Dynamics and inclusion it's also learning some other factors which we somehow cannot interpret them very well right now yeah but nonetheless it's an unsupervised approach so the the fact that it works at all is almost amazing to me on some level and I I thought that was a novel introduction that you would then be able to weight the the loss function based on you're you're almost like filtering out the things that are too dynamic or chaotic is that the way you guys look at it basically the the motivation is to improve the uh robustness of our learning pipeline to those like factors that cannot be explained by synthesis formulation by waving the the gradients based on expl expandability mask uh you'll be able to filter out those noisy signals so the network to learn to do those tasks you're interested in in a more best way I'm curious about how you see this potentially fitting into a driverless car scenario how would the system like that be able to benefit from the research you've done for sell driving cars for instance they need probably would need very accurate and very dense geometry reasoning capabilities currently most of driving cars are Bic using these light out sensors and usually you gives you they only give you a sparse point class so with this approach potentially you could try to densify Barse Point clouds obtained from Lia sensor with our approach and also our eag motion estimation module turns out to be pretty accurate compared to the uh traditional approaches like slam slam is short for a simultaneous localization and mapping in addition to the geometry also provide estimation for the cam post or the trajectory of the cameras so those could be also be very useful for the self driving scenarios where you want to know exactly where the car moves so as you'd mentioned the self-driving car has a lot more Telemetry available than a typical scenario it has the lar and maybe some other auxiliary systems mainly because it's it's so important that the self-driving car know exactly what's around it in order to provide safety but there are other applications like you know as You' mentioned like YouTube videos where you generally do not have any light R data or any 3D information as well now we're in a scenario where your methods may be applicable where is what's developed in self-driving cars could never apply there I was curious if there's any specific areas or applications you were excited about taking this research being able to do single view geometry reasoning has lots of applications for instance when you take a selfie you want to be able to to add like depth effects to the portrait you just taken you can already see this effects in iPhone plus iPhone iPhone 7 or iPhone 7 plus basically you could try to uh Focus the image in the foreground portrait instead of the background automatically and uh with our approach like if it's able to Trend and work well on general internet videos then you can basically refocus your photos based on a predicted depth to give you a more visually compelling uh effect another application might be to basically reconstruct the 3D things you have captured mhm yeah so if the system is able to uh train and work well on just general videos you will be able to sort of reconstruct different objects different scenes maybe like have some sort of VR application where you can basically like data user to move in uh different camera angles uh to explore the scene from you know many different viewpoints to have a sort of imers experience obviously your Technique has some disadvantages compared to like the self-driving car example I keep going back to because they have lar and all these other supervised features available so um it makes you know if you could even get close to their results it would be massively impressive could you tell me a little bit about the uh benchmarks you described in the paper so we compare two types of approaches one is the method that uses laser scanner measurements at supervision to Trend these uh depth estimation system the other one is a more recent approach that doesn't assume depth but assumes registered cameras so in this set up there's uh stereo cameras capturing stereo videos during training and then you do single view depth estimation during testing just like in our case so we are uh comparable to these approaches uh we compare to some of these approaches but uh we're definitely not stateof the art in this single view depth estimation task because of the you know unsupervised nature I think one particular paper that shows quite impressive performance is this paper by uh kemman Gard you paper title is unsupervised molecular depth estimation with Left Right consistency so they assume stereo cameras during training but also just does single V depth during testing I think there's still some improvement to be made to be able to reach uh layer level performance but I'm pretty excited about being able to just train on General videos yeah being able to uh allow the system to watch General videos and have the capability of doing geometry and posst inference maybe to wind up I'd love to go through some questions about the general architecture of your deep Learning Network can you describe you know not just you know a little bit about the convolutional layers and that sort of thing which of course you're interesting but also how you guys came to the decisions you made about how to architect it one critical decision we made about architecture is to have multiscale predictions by m scale I mean the images when we when we make a prediction it doesn't just output per pixel depth in the original input image scale but it also predicts the depth in the image that is prise to a smaller scale by doing this you're able to derive gradients in a larger neighborhood of image pixels yeah basically the the motivational using multi scales is to allow the gradients to be propagated from a larger neighborhood to facilitate the training of the network can you tell me a little bit about the general architecture how many you know convolutional layers did you have how long did it take to train that sort of thing uh we basic adopt architecture from a another paper nowadays it's called a unet architecture where you basically connect features extracted in lower levels to higher levels when they do the prediction so the UN is basically also like the the skip architectures people use from before uh the motivation is to be able to get final predictions by utilizing lower level features instead of just higher level features that might have abstract away too much details so I wanted to wind up by asking what's next for this line of research you have some further steps or was this kind of the conclusion to some work I'm interested in exploring more in this direction uh in particular to be able to explicitly make estimation about the Dynamics currently we just have a mask that couples everything together and have a just output per pixel prediction of whether it's expandable or not it would be nice to actually be able to factorize these components into more explicit predictions we will be interested in being able to have more explicit estimation of the Dynamics of the scene going to infer the motion of the pedestrians where they are moving and how fast they are moving or the cars on the streets so this could be very useful for you know self-driving applications where you really want to understand not only the geometry but also the Dynamics as well no I think this is really interesting I I especially like that it seems to be even though you know my initial attraction was like oh let's learn more about how this can apply in self-driving cars what I was pleasantly surprised to learn is that this as a technique is has much more wide and general use cases available so I'm very eager to see where the line of work goes yeah we're excited too is there any place people can follow you online on Twitter or a blog or anything like that so this one thing that we're going to launch very soon which is uh blog like official blog by Berkeley AI lab we're going to have like blogs about the research done in our lab this paper will be one of the first few blog posts well fantastic I'm looking forward to seeing that uh be sure to let me know I'll share it on the our site and our mailing list when it comes out yeah that'll be great yeah I definitely will uh keep you posted excellent well tingu thank you so much for taking the time to come on and share some details about your work with the listening audience I think it's a really interesting paper and I'm glad we had a chance to chat about it yeah thank you very much for having me yeah it's very nice talking to you data skeptic is a listener supported program to support the show visit datas skeptic.com and click on the membership [Music] tab

Original Description

This episode is an interview with Tinghui Zhou.  In the recent paper "Unsupervised Learning of Depth and Ego-motion from Video", Tinghui and collaborators propose a deep learning architecture which is able to learn depth and pose information from unlabeled videos.  We discuss details of this project and its applications.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 49 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video discusses a deep learning architecture for unsupervised depth perception from unlabeled videos, and its applications in computer vision and systems design. The architecture learns depth and pose information, enabling various applications such as robotics and autonomous vehicles. By understanding this technology, viewers can develop their own depth perception models and design more efficient computer vision systems.

Key Takeaways

Read the paper 'Unsupervised Learning of Depth and Ego-motion from Video'
Implement the proposed deep learning architecture
Test the model on unlabeled videos
Evaluate the performance of the model
Integrate the model into a larger system

💡 Unsupervised learning can be used to learn depth and pose information from unlabeled videos, enabling various applications in computer vision and robotics.

🔒 Pro feature: Ask AI to explain this lesson →

More on: CV Basics

View skill →

Identify Horses or Humans with TensorFlow and Vertex AI

How to Build and Install OpenCV from Source | Using Visual Studio and CMake | Computer Vision

How to Build and Install OpenCV from Source | Using Visual Studio and CMake | Computer Vision

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Apply OpenGL Texturing and Camera Systems

Apply OpenGL Texturing and Camera Systems

Aerial Image Segmentation with PyTorch

Aerial Image Segmentation with PyTorch

How to Install Stable Diffusion - automatic1111

How to Install Stable Diffusion - automatic1111

Sebastian Kamph

Related AI Lessons

Monolith vs Microservices: A Real-World Architectural Autopsy

Learn to decide between monolith and microservices architectures for your project and why it matters for scalability and maintainability

Dev.to · Erwin Wilson Ceniza2

How I Structured My Next.js 14 App Router Project — And Why It Scales

Learn how to structure a scalable Next.js 14 App Router project for better organization and maintainability

Dev.to · Mbanefo Emmanuel Ifechukwu

Let’s write a simple Lexer in Go

Learn to build a simple lexer in Go to understand source code tokenization

Medium · Programming

The Hardest Part Of Microservices Is Undoing What Already Succeeded

Learn how to refactor monolithic ERP systems into microservices, focusing on undoing existing successful implementations

Medium · Programming

Retracing It All With My Son