Video Classification with Deep Learning

Connor Shorten · Advanced ·🧬 Deep Learning ·7y ago

Key Takeaways

This video explains how to use Deep Convolutional Neural Networks for video classification, covering spatial temporal CNNs, multi-resolution models, and data augmentation.

Full Transcript

[Music] this video will explain video classification with convolutional neural networks the overview of the presentation is as follows first we can talk about video data what makes collecting video data so challenging to why each individual instance is so large file size then we'll talk about the spatial temporal CNN's pictured here that are presented in the paper then we'll talk about multi resolution models and how the author has achieved a four times speed up by using multi resolution streams it looks like about how data augmentation is used in video classification dataset noise and the video data sets and then the overall spatial temporal multi resolution model results in video classification achieving big datasets is very difficult this isn't just due to labeling problems but also mainly due to the storage size so a video compared to an image is a stack of frames so if you have a 200 by 200 resolution with RGB you have 200 by 200 by 3 pixels in each pixel it needs 8 bytes to store the to 0 to 255 values but a video would be a stack of these frames so for one second of video you have 30 images at 30fps so the data sets that they use are the sports 1 million and UCF 101 this word remain contains 1 million YouTube videos in 487 classes and the UCF 101 dataset contains 32 13,000 videos in 100 classes 101 classes so they pre process the video data by cropping them to a fixed size length and this is problematic because in sequence learning you want to be able to deal with variable length sequences for example you want to be able to classify a 30-second video with the same ability as you classify a 45 second video this is one of the key ideas in the paper the spatial temporal CNN so what they're gonna try to do is take advantage of CNN's convolutional neural networks across different time scales the first model is a single frame model where they just uses CNN to extract image features from a single frame in the video the late fusion has very wide spaces between frames used to aggregate features and their experiments they use 15 frames in between the two frames used in late fusion early fusion collects a contiguous chunk of frames and process it similar to the single frame model the slow fusion model takes these interesting overlapping patches process them processes them in separate hours and then combines them later on the multi resolution model is used to reduce computation and this is something that Elon Musk was talking about with Lex Friedman in their interview what they do is they have a contact scream and a phobia stream and the context stream operates on the downsampled overall clip whereas the phobia stream focuses on the high resolution center prop and this is they choose a center crop due to camera bias this is an overall diagram showing how the multi resolution model works on the top is the center crop from the original high resolution image and on the bottom is the downsized original image for data augmentation they resize all video clips to 200 by 200 pixels they randomly sample of 170 170 region and then with 50% probability they horizontally flip the images additionally they pre process the images by subtracting the mean from all pixels one other interesting detail is the hierarchical output space of the sports one million data set hierarchical output spaces are used in word to Veck and then using the famous famous paper dermatologists level skin cancer classification so what they do is the model doesn't directly predict the classes in the same way as if all the classes were unrelated what it does is it predicts a traversal along the tree and then it's criticized based on the leaf and then the hierarchical nodes as well so if it makes it to the bowling node it'll be even if it gets a like the type of bowling incorrectly it'll receive less of a loss than as if it had gone to one of the football parent nodes so there is some noise in the data set as well the way the sports one-million data set is constructed is that it's annotated based on the tag prediction algorithm or the uploader provided description in addition to this there's a lot of variation within the frames so what they give as an example as if it's a video labeled as soccer it might be some clips of been playing soccer and then some of the scoreboard or bleachers or sky or something like that so they train this model for over a period of one month and this is where they know that multi-resolution architecture speedo so you know that they would be able to do five clips per second with full frame net but they achieve 20 Clips per second with the multi resolution network during this training they see approximately 500 million examples throughout the training period these are the results that the different models achieve on the sports one million data set as it's shown here the CNN is able to outperform the feature histograms however the different spatial-temporal models don't really seem too different from each other too much these are some different classes where spatial-temporal models outperform single frame models and it's kind of interesting to see because it maybe suggests that the inter class variance is highly is influential on whether or not the spatial features temporal features are useful one other interesting thing they do is they test the effect of transfer learning from sports with million to UCF 101 and then they fine-tune on the 101 classes from the UCF dataset so this is really interesting and they achieve really good result doing this so in conclusion the CNN is able to outperform the visual backwards features for video classification the multi-resolution model probably the most interesting component of this paper saves four times of the computation cost they find success with transfer learning from sports 1 million to UCF 101 but the spatial temporal designs are very effective between one another they outperform generally outperform the backwards features but the difference between late fusion in the early fusions so fusion doesn't seem to be very big thanks for watching this video on video classification the paper link is provided in the description please subscribe to Henry AI labs for more deep learning videos [Music]

Original Description

This video will explain how to use Deep Convolutional Neural Networks to classify Videos. Thanks for watching, please subscribe for more videos on Deep Learning! Paper Link: Large-scale Video Classification with Convolutional Neural Networks: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 22 of 60

1 DenseNets
DenseNets
Connor Shorten
2 DeepWalk Explained
DeepWalk Explained
Connor Shorten
3 Inception Network Explained
Inception Network Explained
Connor Shorten
4 StackGAN
StackGAN
Connor Shorten
5 StyleGAN
StyleGAN
Connor Shorten
6 Progressive Growing of GANs Explained
Progressive Growing of GANs Explained
Connor Shorten
7 Improved Techniques for Training GANs
Improved Techniques for Training GANs
Connor Shorten
8 Word2Vec Explained
Word2Vec Explained
Connor Shorten
9 Must Read Papers on GANs
Must Read Papers on GANs
Connor Shorten
10 Unsupervised Feature Learning
Unsupervised Feature Learning
Connor Shorten
11 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
12 Embedding Graphs with Deep Learning
Embedding Graphs with Deep Learning
Connor Shorten
13 Transfer Learning in GANs
Transfer Learning in GANs
Connor Shorten
14 ReLU Activation Function
ReLU Activation Function
Connor Shorten
15 AC-GAN Explained
AC-GAN Explained
Connor Shorten
16 SimGAN Explained
SimGAN Explained
Connor Shorten
17 DC-GAN Explained!
DC-GAN Explained!
Connor Shorten
18 ResNet Explained!
ResNet Explained!
Connor Shorten
19 Graph Convolutional Networks
Graph Convolutional Networks
Connor Shorten
20 Neural Architecture Search
Neural Architecture Search
Connor Shorten
21 Henry AI Labs
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Video Classification with Deep Learning
Connor Shorten
23 BigGANs in Data Augmentation
BigGANs in Data Augmentation
Connor Shorten
24 Introduction to Deep Learning
Introduction to Deep Learning
Connor Shorten
25 EfficientNet Explained!
EfficientNet Explained!
Connor Shorten
26 Self-Attention GAN
Self-Attention GAN
Connor Shorten
27 Curriculum Learning in Deep Neural Networks
Curriculum Learning in Deep Neural Networks
Connor Shorten
28 Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
29 Deep Compression
Deep Compression
Connor Shorten
30 Skin Cancer Classification with Deep Learning
Skin Cancer Classification with Deep Learning
Connor Shorten
31 Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
32 The Lottery Ticket Hypothesis Explained!
The Lottery Ticket Hypothesis Explained!
Connor Shorten
33 SqueezeNet
SqueezeNet
Connor Shorten
34 GauGAN Explained!
GauGAN Explained!
Connor Shorten
35 AutoML with Hyperband
AutoML with Hyperband
Connor Shorten
36 DL Podcast #3 | Yannic Kilcher | Population-Based Search
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
37 Weakly Supervised Pretraining
Weakly Supervised Pretraining
Connor Shorten
38 Image Data Augmentation for Deep Learning
Image Data Augmentation for Deep Learning
Connor Shorten
39 Unsupervised Data Augmentation
Unsupervised Data Augmentation
Connor Shorten
40 Wide ResNet Explained!
Wide ResNet Explained!
Connor Shorten
41 RevNet: Backpropagation without Storing Activations
RevNet: Backpropagation without Storing Activations
Connor Shorten
42 GANs with Fewer Labels
GANs with Fewer Labels
Connor Shorten
43 BigBiGAN Unsupervised Learning!
BigBiGAN Unsupervised Learning!
Connor Shorten
44 Self-Supervised Learning
Self-Supervised Learning
Connor Shorten
45 Multi-Task Self-Supervised Learning
Multi-Task Self-Supervised Learning
Connor Shorten
46 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
47 Population Based Training
Population Based Training
Connor Shorten
48 Show, Attend and Tell
Show, Attend and Tell
Connor Shorten
49 Siamese Neural Networks
Siamese Neural Networks
Connor Shorten
50 WaveGAN Explained!
WaveGAN Explained!
Connor Shorten
51 VAE-GAN Explained!
VAE-GAN Explained!
Connor Shorten
52 Evolution in Neural Architecture Search!
Evolution in Neural Architecture Search!
Connor Shorten
53 AI Research Weekly Update August 18th, 2019
AI Research Weekly Update August 18th, 2019
Connor Shorten
54 Weight Agnostic Neural Networks Explained!
Weight Agnostic Neural Networks Explained!
Connor Shorten
55 AI Research Weekly Update August 25th, 2019
AI Research Weekly Update August 25th, 2019
Connor Shorten
56 Neuroevolution of Augmenting Topologies (NEAT)
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
57 CoDeepNEAT
CoDeepNEAT
Connor Shorten
58 AI Research Weekly Update September 1st, 2019
AI Research Weekly Update September 1st, 2019
Connor Shorten
59 Randomly Wired Neural Networks
Randomly Wired Neural Networks
Connor Shorten
60 Genetic CNN
Genetic CNN
Connor Shorten

This video teaches how to use Deep Convolutional Neural Networks for video classification, covering key concepts such as spatial temporal CNNs, multi-resolution models, and data augmentation. The video also discusses the challenges of collecting and processing video data and how to overcome them using techniques such as transfer learning.

Key Takeaways
  1. Collect and preprocess video data
  2. Design and train a deep learning model using Convolutional Neural Networks
  3. Apply spatial temporal CNNs and multi-resolution models for efficient computation
  4. Use data augmentation to improve model performance
  5. Evaluate model performance using metrics such as accuracy and loss
💡 The multi-resolution model can reduce computation cost by four times, making it a valuable technique for efficient video classification.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →