Why Deep Representations? (C1W4L04)
Key Takeaways
The video discusses the importance of deep representations in neural networks, highlighting how they can learn complex functions by composing simpler ones, and how this applies to various data types such as images and speech recognition. It also touches on the benefits of deep networks over shallow ones, including their ability to compute certain mathematical functions more easily.
Full Transcript
we've all been hearing that deep neural networks work really well for a lot of problems it's not just that they need to be big neural networks is that specifically they need to be deep or to have a lot of hidden layers so why is that let's go for a couple examples and try to gain some intuition for why deep networks might work well so first what is a deep network computing if you're building a system for face recognition or face detection here's what the deep neural network could be doing perhaps you input a picture of a face then the first layer of the neural network you can think of as maybe being a feature detector or an edge detector in this example I'm plotting what a neural network with maybe twenty hidden units might be trying to compute on this image with the twenty hidden units visualized by these little square boxes so for example this little visualization represents a hidden unit that's trying to figure out if you know where the edges of that orientation are in the image and maybe this hidden unit might be trying to figure out where are the horizontal edges in this image and when we talk about convolutional networks in a later course of this particular visualization we'll make a bit more sense but the form you can think of the first lived in your network as looking a picture and trying to figure out you know where are the edges in this picture now let's figure out where the edges in this picture by grouping together pixels to form edges it can then take the detected edges and group edges together to form parts of faces so for example you might have a loner on trying to see if is finding an eye or a different neuron trying to find that part of the nose and so by putting together lots of edges it can start to detect different parts of faces and then finally by putting together different parts of faces that can I or a nose or an ear or chin it can then try to recognize or detect different types of faces so intuitively you can think of the earlier layers of a neural network is detecting simpler functions like edges then composing them together in the later layers of a neural network so that they can learn one more complex functions these visualizations will make more sense when we talk about convolutional nets and one technical detail of this visualization the edge detectors are looking in relatively small areas of an image may be very small regions like that and then the facial detectors you know can look at may be much larger areas of the image but the main intuition when you take away from this is just finding simpler things like edges and then building them up composing them together to detect more complex things like an iron there was in the composing those together to find even more complex things and this type of simple to complex hierarchical representation or compositional representation applies in other types of data than images and and face recognition as well for example if you're trying to build a speech recognition system it's hard to do visualise speech but if you input an audio clip there may be the first level of a neural network might learn to detect you know low level audio waveform features such as is this tone going up is this going down is it a white noise or sibilant sound lights right and what is the pitch but it can select to type low level waveform features like that and then by composing low level waveforms maybe of learn to detect basic units of sound so in linguistics they called phonemes but for example in the word cat the cup is a phoneme that up is a phoneme the term is another phoneme but learns to find maybe the basic units of sound and then composing that together maybe you learn to recognize words in the audio and then maybe compose those together in order to recognize entire you know phrases or sentences so deep neural network with multiple hidden layers might be able to have the earlier layers learn these lower levels simpler features and then have the later deeper layers then put together the simpler things is detected in order to detect more complex things like recognize specific words or even phrases or sentences that you serving in order to carry-out speech recognition and what we see is that whereas the earlier layers are computing what seems like relatively simple functions of the input such as we're at the edges by the time you get deep in the network you can actually do you know surprisingly complex things such as detect faces or detect words or phrases or sentences some people like to make an analogy between deep neural networks and the human brain where we believe on neuroscientists believe that the human brain also starts off detecting simple things like edges in what your eyes see and then builds those up to detect more complex things like the faces that you see I think analogies between deep learning and the human brain are sometimes a little bit dangerous but you know there is a lot of truth to this being how we think the human brain works and that the human brain probably detects simple things like edges first and then puts them together to form more and more complex objects and so that has served as a loose form of inspiration for some deep learning as well we'll say a bit more about the human brain or about the biological brain in a later video this week the other piece of intuition about why deep networks seems to work well is the following so this result comes from circuit theory which pertains to thinking about what types of functions you can compute with different hand gates and or gates and not gates bassy logic gates so informally their functions in computer were viral ative Li small but deep neural network and by small I mean the number of hidden units is relatively small but that if you try to compute the same function with a shallow network so if you aren't allowed enough hidden layers then you might require exponentially more hidden units to compute so let me just give you one example and illustrate this a bit informally but let's say you're trying to compute the exclusive-or or the parity of all your input features you can't compute X 1 X 4 X 2 X 4 X 3 X or up to UM it and if you have n or NX features so if you build an X or tree like this right so first compute the XOR of X 1 the next two then take X 3 and X 4 and compute their XOR and technically if you're just using and or not gate you might need a couple layers to compute the XOR function rather than just one layer but with a relatively small circuit you can compute the XOR right and so on and then you can you know build really an X or tree like so until eventually you have a circuit here that outputs you know the all let's call this Y that outputs y hat equals y the exclusive or the parity of all of these input bits so the compute the XOR the depth of the network will be on the order of log n right when this type of XOR tree so the number of nodes and the number of circuit circuit components or the number of gates in this network is not that large you don't need that many gates in order to compute the exclusive-or but now if you're not allowed to use a new network with multiple hidden layers with in this case order log and hidden layers if you're forced to compute this function with just one hidden layer right so you have all these things going into you know sort of hidden units and then these things then outputs Y then in order to compute the parity of X to compute this XOR function this hidden layer will need to be exponentially large because essentially you need to exhaustively enumerate all 2 to the N possible configurations so on the order of 2 to the N possible configurations of the input bits that result in the exclusive or being either zero so you end up needing a hidden layer that is exponentially large in the number of bits I think technically you could do this we have 2 to the N minus 1 hidden units right but that's the order 2 to the N is gonna be exponentially large in the number of bits so hope this gives a sense that there are mathematical functions that are much easier to compute with deep networks than with shallow networks I have to admit I personally found the result from circuit theory less useful for gaining intuitions but this is one of the results that people often cite when just when explaining the value of having very deep representations now in addition to these reasons for preferring deep neural networks to be perfectly honest I think the other reason the term term deep learning has taken off it's just branding right these things used to be called neural networks above all of hidden layers but the phrase deep learning you know it's just a great brand it just is so deep right so I think that once that term called on that really neuro networks rebranded or new networks with many hidden layers rebranded helped to capture the popular imagination as well but regardless of the PR branding deep networks do work well sometimes people go overboard and insist on using tons of hidden layers but when I'm starting on a new problem I often really start out with even logistic regression and try something with one or two hidden layers and use that as a hyper parameter you said as a parameter or hyper parameter that you tune in order to try to find the right therefore your neural network but over the last several years there has been a trend toward people finding that for some applications very very deep neural networks you know maybe many dozens of layers sometimes can sometimes be the best model for a problem so that's it for the intuitions for why deep learning seems to work well let's now take a look at the mechanics of how to implement not just for propagation but also back propagation
Original Description
Take the Deep Learning Specialization: http://bit.ly/32Iw01H
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 31 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
▶
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
Simplify model selection in Amazon Bedrock with the open source Model Profiler
AWS Machine Learning
ChronoCast : The Time Series project
Medium · Machine Learning
Beyond Price: Building an Ensemble Volatility Intelligence System for XAU/USD
Medium · Machine Learning
Gate on what the model can't author (my comment section redesigned my trust model)
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI