The Function That Changed Everything

Underfitted · Beginner ·🧬 Deep Learning ·3y ago

Key Takeaways

The video discusses the history and development of deep learning, highlighting the importance of activation functions, particularly the Rectified Linear Unit (ReLU), in enabling deep neural networks to solve complex problems. It covers the limitations of earlier activation functions, such as sigmoid and tanh, and how ReLU overcame the vanishing gradient problem.

Full Transcript

[Music] I dare you to think of anything moving faster than deep learning but what if I tell you that behind all of it the complexity the hardware the data behind all of that there is a simple piece that took us years to find let me tell you the story about the unreasonable effectiveness of the function that made deep learning possible the thinking machine it's 1958 and the New York time publishes an article about a device that will be able to walk talk see and write then just interviewed the scientist that's about to change the world his name is Frank rosenblatt an American psychologist who published this report one year before in 1957. here rosenblatt proposes the construction of what many consider the first predecessor of the neural networks we use today the perceptron scientists built a few working perceptrons as these artificial brains were called April 1957. this is a receipt from the lab acknowledging the report and here is what it says about the project designing fabricating and evaluating an electronic brain model we've been trying to figure out neural networks for a long time fast forward 50 years mid to late 2000s and we still couldn't train a neural network with more than a couple of hidden layers but to understand what's missing we first need to talk about something we call activation functions regardless of how big neural networks are by default they can only solve linear problems unfortunately most things that matter are more complex than that here is one example imagine you have two classes orange and blue and you want to draw a function that separates them a neural network cannot solve this problem unless we use non-linear activation functions number of layers number of neurons none of that matters here is what Wikipedia says about activation functions only non-linear activation functions allow such networks to compute non-trivial problems we need these activations to create some sort of bump a disturbance that will allow networks to solve all sorts of problems like this one here but we've known about activation functions for a long time but that wasn't enough something was not right with these functions Sigma antennas were by far the two most popular activation functions back then look at the blue lines and don't worry about the red lines for now these functions checked every single box we needed to train neural networks well almost every single box to tackle more complex problems like image recognition text generation and audio translation we needed deeper network but as soon as we try with more than a few layers neural networks wouldn't work at all let me show you this this here is the tensorflow playground anytime I want to play with neural networks I come here because I don't have to write any code and it's really easy for me to try any of my wacky theories that is here this is configured to solve the same problem I showed you before I'm gonna add a few more hidden layers to this network I'm gonna change the activation to Sigma point and then I'm gonna click play I did this before and I let it run for a long time before I stopped it over 5 000 iterations that the network could not solve the problem but wait that's not necessarily an issue right maybe sigmoid is not good enough to solve this particular problem except I run the same experiment but instead of using six hidden layers I used just two and the network solved it this was the thing preventing deep learning from becoming something we couldn't train deeper networks because they wouldn't work my feeling is if you want to understand a really complicated device like a brain you should build one that was Joffrey hinston's voice he played a central role in making deep learning a reality but to appreciate what happened we first need to understand why this activation functions didn't work with deep networks time to look at the red lines now these are the derivatives of the functions we use these gradients during back propagation to update the weights of the network the deeper the network the more iterations we need I'm not gonna get too deep into the math here but if the gradients are smaller than one and you multiply a bunch of them the results will get smaller and smaller look at the gradients of these two functions the maximum possible value of sigmoids gradient is 0.25 that's really small and for 10 inches one but that only happens at this particular point the gradient is very small everywhere else and that right there is the problem the deeper the network the smaller the updates get until they're so small that the network dies we call this the vanishing gradient problem and that's why these functions did not work work we should look at biology and we should try and make systems that work roughly like the brain okay it's 2010 Emperor comes out paper that proposes an idea so simple that it looks ridiculous they show how a function they call Rectify linear unit solves the problem they had with the other activation functions here is what the paper says Rectify linear units preserve information about relative intensities as information travels through multiple layers of feature detectors this was the function that's it this was the crucial missing piece Nair and Hinton wrote the paper and although they made this function popular I found references to it from decades before like in this paper from 1975. Fukushima doesn't give this function a name but that's the rectified linear unit in the context of neural networks but here is the most surprising part this function that works so well doesn't even meet one of the most basic requirements of an activation function this function is not differentiable so how come the simplest function that doesn't even meet the requirements is the one that makes everything works let's look at the laptop I have a very simple notebook here to plot the Rectify linear unit for for short we call it relu if I run this cell we get the chart here is the plot of the relu function the x-axis is the input to the function while the y-axis is the output of it notice how relu returns 0 for any negative input and it doesn't touch positive values at the point where x equals zero we cannot compute the derivative of the function and that's a big problem in theory it turns out that in practice we can return a specific value for that particular point and everything works fine here I'm returning zero and if I plot the derivative is the red line on the chart one final thing for you to notice on this plot the gradient is either zero for any negative values or one for positive values that means that value doesn't suffer from the vanishing gradient issue the signal propagating through the network will never disappear that's a major win and the main reason this simple function works so well by the way the gradient is only one of relu's advantages the function is very simple so it's really fast to compute and it doesn't saturate like sigmoid and 10 H2 it just works exceptionally well and thanks to it we have deep learning today and by the way when I think about this story and how well really works despite the simplicity I can't help but wonder what else are we missing today

Original Description

This is a story about the unreasonable effectiveness of the function that made deep learning possible. Citations: https://gist.github.com/svpino/8c34ecb612f9f66c13f7542a9e5043cc 🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1 📚 My 3 favorite Machine Learning books: • Deep Learning With Python, Second Edition — https://amzn.to/3xA3bVI • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — https://amzn.to/3BOX3LP • Machine Learning with PyTorch and Scikit-Learn — https://amzn.to/3f7dAC8 Twitter: https://twitter.com/svpino Disclaimer: Some of the links included in this description are affiliate links where I'll earn a small commission if you purchase something. There's no cost to you.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Underfitted · Underfitted · 21 of 60

1 Test-Time Augmentation In Machine Learning.
Test-Time Augmentation In Machine Learning.
Underfitted
2 Don't Replace Missing Values In Your Dataset.
Don't Replace Missing Values In Your Dataset.
Underfitted
3 Introduction to Adversarial Validation In Machine Learning.
Introduction to Adversarial Validation In Machine Learning.
Underfitted
4 Introduction To Autoencoders In Machine Learning.
Introduction To Autoencoders In Machine Learning.
Underfitted
5 Active Learning. The Secret of Training Models Without Labels.
Active Learning. The Secret of Training Models Without Labels.
Underfitted
6 Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Early Stopping. The Most Popular Regularization Technique In Machine Learning.
Underfitted
7 The Confusion Matrix in Machine Learning
The Confusion Matrix in Machine Learning
Underfitted
8 3 Tips to Build a Career in Machine Learning (Unconventional Advice)
3 Tips to Build a Career in Machine Learning (Unconventional Advice)
Underfitted
9 I can predict cars CRASHING. And it's 99% accurate!
I can predict cars CRASHING. And it's 99% accurate!
Underfitted
10 A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.
Underfitted
11 The BEST Machine Learning Interview Strategy.
The BEST Machine Learning Interview Strategy.
Underfitted
12 OpenAI’s Whisper is AMAZING!
OpenAI’s Whisper is AMAZING!
Underfitted
13 5 Lessons You’re NOT Taught in School
5 Lessons You’re NOT Taught in School
Underfitted
14 TensorFlow On Apple Silicon. Step-by-Step Instructions
TensorFlow On Apple Silicon. Step-by-Step Instructions
Underfitted
15 Generating Images From Text. Stable Diffusion, Explained
Generating Images From Text. Stable Diffusion, Explained
Underfitted
16 The Wrong Batch Size Will Ruin Your Model
The Wrong Batch Size Will Ruin Your Model
Underfitted
17 8 Mistakes Holding Your Career Back | Machine Learning
8 Mistakes Holding Your Career Back | Machine Learning
Underfitted
18 AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained
Underfitted
19 Bias and Variance, Simplified
Bias and Variance, Simplified
Underfitted
20 Should You Stop Splitting Your Data Like This?
Should You Stop Splitting Your Data Like This?
Underfitted
The Function That Changed Everything
The Function That Changed Everything
Underfitted
22 This Model Caused A Nuclear Disaster
This Model Caused A Nuclear Disaster
Underfitted
23 Will Your Code Write Itself?
Will Your Code Write Itself?
Underfitted
24 The Simplest Encoding You’ve Never Heard Of
The Simplest Encoding You’ve Never Heard Of
Underfitted
25 Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Superhuman AI Cracked An Impossible Game! | DeepNash, Explained
Underfitted
26 Can you become a Data Scientist without a Ph.D?
Can you become a Data Scientist without a Ph.D?
Underfitted
27 How to 10x your productivity with ChatGPT?
How to 10x your productivity with ChatGPT?
Underfitted
28 Cheating the Prisoner's Dilemma
Cheating the Prisoner's Dilemma
Underfitted
29 We integrated OpenAI's Whisper with Spot
We integrated OpenAI's Whisper with Spot
Underfitted
30 The Machine Learning School program
The Machine Learning School program
Underfitted
31 We integrated ChatGPT with our robots
We integrated ChatGPT with our robots
Underfitted
32 Solving complex tasks using a Large Language Model (LLM)
Solving complex tasks using a Large Language Model (LLM)
Underfitted
33 5 problems when using a Large Language Model
5 problems when using a Large Language Model
Underfitted
34 We just discovered faster sorting algorithms!
We just discovered faster sorting algorithms!
Underfitted
35 The 3 most important updates to OpenAI's API.
The 3 most important updates to OpenAI's API.
Underfitted
36 People are divided! Does GPT-4 understand what it says?
People are divided! Does GPT-4 understand what it says?
Underfitted
37 How much should you charge hourly as a Machine Learning freelancer?
How much should you charge hourly as a Machine Learning freelancer?
Underfitted
38 Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Building a RAG application from scratch using Python, LangChain, and the OpenAI API
Underfitted
39 Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Building a RAG application using open-source models (Asking questions from a PDF using Llama2)
Underfitted
40 How to evaluate an LLM-powered RAG application automatically.
How to evaluate an LLM-powered RAG application automatically.
Underfitted
41 Step by step no-code RAG application using Langflow.
Step by step no-code RAG application using Langflow.
Underfitted
42 I built a simple game using Langchain. Here is a step by step tutorial.
I built a simple game using Langchain. Here is a step by step tutorial.
Underfitted
43 I used the first AI Software Engineer for a week. This is happening.
I used the first AI Software Engineer for a week. This is happening.
Underfitted
44 I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.
Underfitted
45 How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)
Underfitted
46 How to train a model to generate image embeddings from scratch
How to train a model to generate image embeddings from scratch
Underfitted
47 Building an AI assistant that listens and sees the world (Step by step tutorial)
Building an AI assistant that listens and sees the world (Step by step tutorial)
Underfitted
48 Why are vector databases so FAST?
Why are vector databases so FAST?
Underfitted
49 A Machine Learning roadmap (the one I recommend to my students)
A Machine Learning roadmap (the one I recommend to my students)
Underfitted
50 How to build a real-time AI assistant (with voice and vision)
How to build a real-time AI assistant (with voice and vision)
Underfitted
51 An introduction to Mojo (for Python developers)
An introduction to Mojo (for Python developers)
Underfitted
52 How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)
Underfitted
53 Building a CI workflow for those who hate it (using GitHub Actions)
Building a CI workflow for those who hate it (using GitHub Actions)
Underfitted
54 How to run Python Code in Mojo 🔥
How to run Python Code in Mojo 🔥
Underfitted
55 AI will not take your job. Here is what I think will happen instead.
AI will not take your job. Here is what I think will happen instead.
Underfitted
56 How to fine-tune a model using LoRA (step by step)
How to fine-tune a model using LoRA (step by step)
Underfitted
57 Late initialization in Mojo🔥 (Python doesn't support this)
Late initialization in Mojo🔥 (Python doesn't support this)
Underfitted
58 The $1,000,000 problem AI can't solve
The $1,000,000 problem AI can't solve
Underfitted
59 A gentle introduction to RAG (using open-source models)
A gentle introduction to RAG (using open-source models)
Underfitted
60 Automating feedback using ChatGPT and Zapier
Automating feedback using ChatGPT and Zapier
Underfitted

The video tells the story of how the Rectified Linear Unit (ReLU) activation function enabled deep learning to solve complex problems. It covers the history of deep learning, the limitations of earlier activation functions, and how ReLU overcame the vanishing gradient problem. By the end of the video, viewers will understand the importance of ReLU in deep learning and how to implement it in a neural network.

Key Takeaways
  1. Understand the concept of activation functions in neural networks
  2. Learn about the limitations of earlier activation functions, such as sigmoid and tanh
  3. Implement ReLU in a neural network
  4. Train a neural network using ReLU to solve a complex problem
  5. Understand the vanishing gradient problem and how ReLU overcomes it
💡 The Rectified Linear Unit (ReLU) activation function is a simple yet powerful function that enables deep neural networks to solve complex problems by overcoming the vanishing gradient problem.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →