Video Classification with Deep Learning
Key Takeaways
This video explains how to use Deep Convolutional Neural Networks for video classification, covering spatial temporal CNNs, multi-resolution models, and data augmentation.
Full Transcript
[Music] this video will explain video classification with convolutional neural networks the overview of the presentation is as follows first we can talk about video data what makes collecting video data so challenging to why each individual instance is so large file size then we'll talk about the spatial temporal CNN's pictured here that are presented in the paper then we'll talk about multi resolution models and how the author has achieved a four times speed up by using multi resolution streams it looks like about how data augmentation is used in video classification dataset noise and the video data sets and then the overall spatial temporal multi resolution model results in video classification achieving big datasets is very difficult this isn't just due to labeling problems but also mainly due to the storage size so a video compared to an image is a stack of frames so if you have a 200 by 200 resolution with RGB you have 200 by 200 by 3 pixels in each pixel it needs 8 bytes to store the to 0 to 255 values but a video would be a stack of these frames so for one second of video you have 30 images at 30fps so the data sets that they use are the sports 1 million and UCF 101 this word remain contains 1 million YouTube videos in 487 classes and the UCF 101 dataset contains 32 13,000 videos in 100 classes 101 classes so they pre process the video data by cropping them to a fixed size length and this is problematic because in sequence learning you want to be able to deal with variable length sequences for example you want to be able to classify a 30-second video with the same ability as you classify a 45 second video this is one of the key ideas in the paper the spatial temporal CNN so what they're gonna try to do is take advantage of CNN's convolutional neural networks across different time scales the first model is a single frame model where they just uses CNN to extract image features from a single frame in the video the late fusion has very wide spaces between frames used to aggregate features and their experiments they use 15 frames in between the two frames used in late fusion early fusion collects a contiguous chunk of frames and process it similar to the single frame model the slow fusion model takes these interesting overlapping patches process them processes them in separate hours and then combines them later on the multi resolution model is used to reduce computation and this is something that Elon Musk was talking about with Lex Friedman in their interview what they do is they have a contact scream and a phobia stream and the context stream operates on the downsampled overall clip whereas the phobia stream focuses on the high resolution center prop and this is they choose a center crop due to camera bias this is an overall diagram showing how the multi resolution model works on the top is the center crop from the original high resolution image and on the bottom is the downsized original image for data augmentation they resize all video clips to 200 by 200 pixels they randomly sample of 170 170 region and then with 50% probability they horizontally flip the images additionally they pre process the images by subtracting the mean from all pixels one other interesting detail is the hierarchical output space of the sports one million data set hierarchical output spaces are used in word to Veck and then using the famous famous paper dermatologists level skin cancer classification so what they do is the model doesn't directly predict the classes in the same way as if all the classes were unrelated what it does is it predicts a traversal along the tree and then it's criticized based on the leaf and then the hierarchical nodes as well so if it makes it to the bowling node it'll be even if it gets a like the type of bowling incorrectly it'll receive less of a loss than as if it had gone to one of the football parent nodes so there is some noise in the data set as well the way the sports one-million data set is constructed is that it's annotated based on the tag prediction algorithm or the uploader provided description in addition to this there's a lot of variation within the frames so what they give as an example as if it's a video labeled as soccer it might be some clips of been playing soccer and then some of the scoreboard or bleachers or sky or something like that so they train this model for over a period of one month and this is where they know that multi-resolution architecture speedo so you know that they would be able to do five clips per second with full frame net but they achieve 20 Clips per second with the multi resolution network during this training they see approximately 500 million examples throughout the training period these are the results that the different models achieve on the sports one million data set as it's shown here the CNN is able to outperform the feature histograms however the different spatial-temporal models don't really seem too different from each other too much these are some different classes where spatial-temporal models outperform single frame models and it's kind of interesting to see because it maybe suggests that the inter class variance is highly is influential on whether or not the spatial features temporal features are useful one other interesting thing they do is they test the effect of transfer learning from sports with million to UCF 101 and then they fine-tune on the 101 classes from the UCF dataset so this is really interesting and they achieve really good result doing this so in conclusion the CNN is able to outperform the visual backwards features for video classification the multi-resolution model probably the most interesting component of this paper saves four times of the computation cost they find success with transfer learning from sports 1 million to UCF 101 but the spatial temporal designs are very effective between one another they outperform generally outperform the backwards features but the difference between late fusion in the early fusions so fusion doesn't seem to be very big thanks for watching this video on video classification the paper link is provided in the description please subscribe to Henry AI labs for more deep learning videos [Music]
Original Description
This video will explain how to use Deep Convolutional Neural Networks to classify Videos.
Thanks for watching, please subscribe for more videos on Deep Learning!
Paper Link: Large-scale Video Classification with Convolutional Neural Networks:
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 22 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
▶
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: Modern CV Models
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI