The Function That Changed Everything
Key Takeaways
The video discusses the history and development of deep learning, highlighting the importance of activation functions, particularly the Rectified Linear Unit (ReLU), in enabling deep neural networks to solve complex problems. It covers the limitations of earlier activation functions, such as sigmoid and tanh, and how ReLU overcame the vanishing gradient problem.
Full Transcript
[Music] I dare you to think of anything moving faster than deep learning but what if I tell you that behind all of it the complexity the hardware the data behind all of that there is a simple piece that took us years to find let me tell you the story about the unreasonable effectiveness of the function that made deep learning possible the thinking machine it's 1958 and the New York time publishes an article about a device that will be able to walk talk see and write then just interviewed the scientist that's about to change the world his name is Frank rosenblatt an American psychologist who published this report one year before in 1957. here rosenblatt proposes the construction of what many consider the first predecessor of the neural networks we use today the perceptron scientists built a few working perceptrons as these artificial brains were called April 1957. this is a receipt from the lab acknowledging the report and here is what it says about the project designing fabricating and evaluating an electronic brain model we've been trying to figure out neural networks for a long time fast forward 50 years mid to late 2000s and we still couldn't train a neural network with more than a couple of hidden layers but to understand what's missing we first need to talk about something we call activation functions regardless of how big neural networks are by default they can only solve linear problems unfortunately most things that matter are more complex than that here is one example imagine you have two classes orange and blue and you want to draw a function that separates them a neural network cannot solve this problem unless we use non-linear activation functions number of layers number of neurons none of that matters here is what Wikipedia says about activation functions only non-linear activation functions allow such networks to compute non-trivial problems we need these activations to create some sort of bump a disturbance that will allow networks to solve all sorts of problems like this one here but we've known about activation functions for a long time but that wasn't enough something was not right with these functions Sigma antennas were by far the two most popular activation functions back then look at the blue lines and don't worry about the red lines for now these functions checked every single box we needed to train neural networks well almost every single box to tackle more complex problems like image recognition text generation and audio translation we needed deeper network but as soon as we try with more than a few layers neural networks wouldn't work at all let me show you this this here is the tensorflow playground anytime I want to play with neural networks I come here because I don't have to write any code and it's really easy for me to try any of my wacky theories that is here this is configured to solve the same problem I showed you before I'm gonna add a few more hidden layers to this network I'm gonna change the activation to Sigma point and then I'm gonna click play I did this before and I let it run for a long time before I stopped it over 5 000 iterations that the network could not solve the problem but wait that's not necessarily an issue right maybe sigmoid is not good enough to solve this particular problem except I run the same experiment but instead of using six hidden layers I used just two and the network solved it this was the thing preventing deep learning from becoming something we couldn't train deeper networks because they wouldn't work my feeling is if you want to understand a really complicated device like a brain you should build one that was Joffrey hinston's voice he played a central role in making deep learning a reality but to appreciate what happened we first need to understand why this activation functions didn't work with deep networks time to look at the red lines now these are the derivatives of the functions we use these gradients during back propagation to update the weights of the network the deeper the network the more iterations we need I'm not gonna get too deep into the math here but if the gradients are smaller than one and you multiply a bunch of them the results will get smaller and smaller look at the gradients of these two functions the maximum possible value of sigmoids gradient is 0.25 that's really small and for 10 inches one but that only happens at this particular point the gradient is very small everywhere else and that right there is the problem the deeper the network the smaller the updates get until they're so small that the network dies we call this the vanishing gradient problem and that's why these functions did not work work we should look at biology and we should try and make systems that work roughly like the brain okay it's 2010 Emperor comes out paper that proposes an idea so simple that it looks ridiculous they show how a function they call Rectify linear unit solves the problem they had with the other activation functions here is what the paper says Rectify linear units preserve information about relative intensities as information travels through multiple layers of feature detectors this was the function that's it this was the crucial missing piece Nair and Hinton wrote the paper and although they made this function popular I found references to it from decades before like in this paper from 1975. Fukushima doesn't give this function a name but that's the rectified linear unit in the context of neural networks but here is the most surprising part this function that works so well doesn't even meet one of the most basic requirements of an activation function this function is not differentiable so how come the simplest function that doesn't even meet the requirements is the one that makes everything works let's look at the laptop I have a very simple notebook here to plot the Rectify linear unit for for short we call it relu if I run this cell we get the chart here is the plot of the relu function the x-axis is the input to the function while the y-axis is the output of it notice how relu returns 0 for any negative input and it doesn't touch positive values at the point where x equals zero we cannot compute the derivative of the function and that's a big problem in theory it turns out that in practice we can return a specific value for that particular point and everything works fine here I'm returning zero and if I plot the derivative is the red line on the chart one final thing for you to notice on this plot the gradient is either zero for any negative values or one for positive values that means that value doesn't suffer from the vanishing gradient issue the signal propagating through the network will never disappear that's a major win and the main reason this simple function works so well by the way the gradient is only one of relu's advantages the function is very simple so it's really fast to compute and it doesn't saturate like sigmoid and 10 H2 it just works exceptionally well and thanks to it we have deep learning today and by the way when I think about this story and how well really works despite the simplicity I can't help but wonder what else are we missing today
Original Description
This is a story about the unreasonable effectiveness of the function that made deep learning possible.
Citations: https://gist.github.com/svpino/8c34ecb612f9f66c13f7542a9e5043cc
🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1
📚 My 3 favorite Machine Learning books:
• Deep Learning With Python, Second Edition — https://amzn.to/3xA3bVI
• Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — https://amzn.to/3BOX3LP
• Machine Learning with PyTorch and Scikit-Learn — https://amzn.to/3f7dAC8
Twitter: https://twitter.com/svpino
Disclaimer: Some of the links included in this description are affiliate links where I'll earn a small commission if you purchase something. There's no cost to you.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Underfitted · Underfitted · 21 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
▶
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Test-Time Augmentation In Machine Learning.
Underfitted
Don't Replace Missing Values In Your Dataset.
Underfitted
Introduction to Adversarial Validation In Machine Learning.
Underfitted
Introduction To Autoencoders In Machine Learning.
Underfitted
Active Learning. The Secret of Training Models Without Labels.
Underfitted
Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Underfitted
The Confusion Matrix in Machine Learning
Underfitted
3 Tips to Build a Career in Machine Learning (Unconventional Advice)
Underfitted
I can predict cars CRASHING. And it's 99% accurate!
Underfitted
A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
Underfitted
The BEST Machine Learning Interview Strategy.
Underfitted
OpenAI’s Whisper is AMAZING!
Underfitted
5 Lessons You’re NOT Taught in School
Underfitted
TensorFlow On Apple Silicon. Step-by-Step Instructions
Underfitted
Generating Images From Text. Stable Diffusion, Explained
Underfitted
The Wrong Batch Size Will Ruin Your Model
Underfitted
8 Mistakes Holding Your Career Back | Machine Learning
Underfitted
AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
Underfitted
Bias and Variance, Simplified
Underfitted
Should You Stop Splitting Your Data Like This?
Underfitted
The Function That Changed Everything
Underfitted
This Model Caused A Nuclear Disaster
Underfitted
Will Your Code Write Itself?
Underfitted
The Simplest Encoding You’ve Never Heard Of
Underfitted
Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Underfitted
Can you become a Data Scientist without a Ph.D?
Underfitted
How to 10x your productivity with ChatGPT?
Underfitted
Cheating the Prisoner's Dilemma
Underfitted
We integrated OpenAI's Whisper with Spot
Underfitted
The Machine Learning School program
Underfitted
We integrated ChatGPT with our robots
Underfitted
Solving complex tasks using a Large Language Model (LLM)
Underfitted
5 problems when using a Large Language Model
Underfitted
We just discovered faster sorting algorithms!
Underfitted
The 3 most important updates to OpenAI's API.
Underfitted
People are divided! Does GPT-4 understand what it says?
Underfitted
How much should you charge hourly as a Machine Learning freelancer?
Underfitted
Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Underfitted
Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Underfitted
How to evaluate an LLM-powered RAG application automatically.
Underfitted
Step by step no-code RAG application using Langflow.
Underfitted
I built a simple game using Langchain. Here is a step by step tutorial.
Underfitted
I used the first AI Software Engineer for a week. This is happening.
Underfitted
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Underfitted
How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
Underfitted
How to train a model to generate image embeddings from scratch
Underfitted
Building an AI assistant that listens and sees the world (Step by step tutorial)
Underfitted
Why are vector databases so FAST?
Underfitted
A Machine Learning roadmap (the one I recommend to my students)
Underfitted
How to build a real-time AI assistant (with voice and vision)
Underfitted
An introduction to Mojo (for Python developers)
Underfitted
How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
Underfitted
Building a CI workflow for those who hate it (using GitHub Actions)
Underfitted
How to run Python Code in Mojo 🔥
Underfitted
AI will not take your job. Here is what I think will happen instead.
Underfitted
How to fine-tune a model using LoRA (step by step)
Underfitted
Late initialization in Mojo🔥 (Python doesn't support this)
Underfitted
The $1,000,000 problem AI can't solve
Underfitted
A gentle introduction to RAG (using open-source models)
Underfitted
Automating feedback using ChatGPT and Zapier
Underfitted
More on: Neural Network Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI