Video Classification with Deep Learning

Connor Shorten · Advanced ·🧬 Deep Learning ·7y ago

Skills: Modern CV Models90%ML Pipelines70%Supervised Learning60%

Key Takeaways

This video explains how to use Deep Convolutional Neural Networks for video classification, covering spatial temporal CNNs, multi-resolution models, and data augmentation.

Full Transcript

[Music] this video will explain video classification with convolutional neural networks the overview of the presentation is as follows first we can talk about video data what makes collecting video data so challenging to why each individual instance is so large file size then we'll talk about the spatial temporal CNN's pictured here that are presented in the paper then we'll talk about multi resolution models and how the author has achieved a four times speed up by using multi resolution streams it looks like about how data augmentation is used in video classification dataset noise and the video data sets and then the overall spatial temporal multi resolution model results in video classification achieving big datasets is very difficult this isn't just due to labeling problems but also mainly due to the storage size so a video compared to an image is a stack of frames so if you have a 200 by 200 resolution with RGB you have 200 by 200 by 3 pixels in each pixel it needs 8 bytes to store the to 0 to 255 values but a video would be a stack of these frames so for one second of video you have 30 images at 30fps so the data sets that they use are the sports 1 million and UCF 101 this word remain contains 1 million YouTube videos in 487 classes and the UCF 101 dataset contains 32 13,000 videos in 100 classes 101 classes so they pre process the video data by cropping them to a fixed size length and this is problematic because in sequence learning you want to be able to deal with variable length sequences for example you want to be able to classify a 30-second video with the same ability as you classify a 45 second video this is one of the key ideas in the paper the spatial temporal CNN so what they're gonna try to do is take advantage of CNN's convolutional neural networks across different time scales the first model is a single frame model where they just uses CNN to extract image features from a single frame in the video the late fusion has very wide spaces between frames used to aggregate features and their experiments they use 15 frames in between the two frames used in late fusion early fusion collects a contiguous chunk of frames and process it similar to the single frame model the slow fusion model takes these interesting overlapping patches process them processes them in separate hours and then combines them later on the multi resolution model is used to reduce computation and this is something that Elon Musk was talking about with Lex Friedman in their interview what they do is they have a contact scream and a phobia stream and the context stream operates on the downsampled overall clip whereas the phobia stream focuses on the high resolution center prop and this is they choose a center crop due to camera bias this is an overall diagram showing how the multi resolution model works on the top is the center crop from the original high resolution image and on the bottom is the downsized original image for data augmentation they resize all video clips to 200 by 200 pixels they randomly sample of 170 170 region and then with 50% probability they horizontally flip the images additionally they pre process the images by subtracting the mean from all pixels one other interesting detail is the hierarchical output space of the sports one million data set hierarchical output spaces are used in word to Veck and then using the famous famous paper dermatologists level skin cancer classification so what they do is the model doesn't directly predict the classes in the same way as if all the classes were unrelated what it does is it predicts a traversal along the tree and then it's criticized based on the leaf and then the hierarchical nodes as well so if it makes it to the bowling node it'll be even if it gets a like the type of bowling incorrectly it'll receive less of a loss than as if it had gone to one of the football parent nodes so there is some noise in the data set as well the way the sports one-million data set is constructed is that it's annotated based on the tag prediction algorithm or the uploader provided description in addition to this there's a lot of variation within the frames so what they give as an example as if it's a video labeled as soccer it might be some clips of been playing soccer and then some of the scoreboard or bleachers or sky or something like that so they train this model for over a period of one month and this is where they know that multi-resolution architecture speedo so you know that they would be able to do five clips per second with full frame net but they achieve 20 Clips per second with the multi resolution network during this training they see approximately 500 million examples throughout the training period these are the results that the different models achieve on the sports one million data set as it's shown here the CNN is able to outperform the feature histograms however the different spatial-temporal models don't really seem too different from each other too much these are some different classes where spatial-temporal models outperform single frame models and it's kind of interesting to see because it maybe suggests that the inter class variance is highly is influential on whether or not the spatial features temporal features are useful one other interesting thing they do is they test the effect of transfer learning from sports with million to UCF 101 and then they fine-tune on the 101 classes from the UCF dataset so this is really interesting and they achieve really good result doing this so in conclusion the CNN is able to outperform the visual backwards features for video classification the multi-resolution model probably the most interesting component of this paper saves four times of the computation cost they find success with transfer learning from sports 1 million to UCF 101 but the spatial temporal designs are very effective between one another they outperform generally outperform the backwards features but the difference between late fusion in the early fusions so fusion doesn't seem to be very big thanks for watching this video on video classification the paper link is provided in the description please subscribe to Henry AI labs for more deep learning videos [Music]

Original Description

This video will explain how to use Deep Convolutional Neural Networks to classify Videos. Thanks for watching, please subscribe for more videos on Deep Learning! Paper Link: Large-scale Video Classification with Convolutional Neural Networks: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 22 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

This video teaches how to use Deep Convolutional Neural Networks for video classification, covering key concepts such as spatial temporal CNNs, multi-resolution models, and data augmentation. The video also discusses the challenges of collecting and processing video data and how to overcome them using techniques such as transfer learning.

Key Takeaways

Collect and preprocess video data
Design and train a deep learning model using Convolutional Neural Networks
Apply spatial temporal CNNs and multi-resolution models for efficient computation
Use data augmentation to improve model performance
Evaluate model performance using metrics such as accuracy and loss

💡 The multi-resolution model can reduce computation cost by four times, making it a valuable technique for efficient video classification.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Modern CV Models

View skill →

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

Statistical Learning: 10.Py Convolutional Neural Network: CIFAR Image Data I 2023

Statistical Learning: 10.Py Convolutional Neural Network: CIFAR Image Data I 2023

Stanford Online

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Nicholas Renotte

Deep Learning with PyTorch : Image Segmentation

Deep Learning with PyTorch : Image Segmentation

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

NVIDIA Developer

Related Reads

I Found the Neural Network I Built in Class 9 — Here’s What Happened When I Tried to Run It Again

Revisiting a 4-year-old neural network project for handwritten digit recognition using a convolutional neural network and analyzing its performance

Medium · Deep Learning

Introduction to Deep Learning and Neural Networks: From Human Brain to Artificial Intelligence

Learn how biological neurons inspired artificial neural networks and deep learning, transforming the AI landscape

Medium · Deep Learning

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train