The Function That Changed Everything

Underfitted · Beginner ·🧬 Deep Learning ·3y ago

Skills: Neural Network Basics90%ML Maths Basics80%Supervised Learning60%

Key Takeaways

The video discusses the history and development of deep learning, highlighting the importance of activation functions, particularly the Rectified Linear Unit (ReLU), in enabling deep neural networks to solve complex problems. It covers the limitations of earlier activation functions, such as sigmoid and tanh, and how ReLU overcame the vanishing gradient problem.

Full Transcript

[Music] I dare you to think of anything moving faster than deep learning but what if I tell you that behind all of it the complexity the hardware the data behind all of that there is a simple piece that took us years to find let me tell you the story about the unreasonable effectiveness of the function that made deep learning possible the thinking machine it's 1958 and the New York time publishes an article about a device that will be able to walk talk see and write then just interviewed the scientist that's about to change the world his name is Frank rosenblatt an American psychologist who published this report one year before in 1957. here rosenblatt proposes the construction of what many consider the first predecessor of the neural networks we use today the perceptron scientists built a few working perceptrons as these artificial brains were called April 1957. this is a receipt from the lab acknowledging the report and here is what it says about the project designing fabricating and evaluating an electronic brain model we've been trying to figure out neural networks for a long time fast forward 50 years mid to late 2000s and we still couldn't train a neural network with more than a couple of hidden layers but to understand what's missing we first need to talk about something we call activation functions regardless of how big neural networks are by default they can only solve linear problems unfortunately most things that matter are more complex than that here is one example imagine you have two classes orange and blue and you want to draw a function that separates them a neural network cannot solve this problem unless we use non-linear activation functions number of layers number of neurons none of that matters here is what Wikipedia says about activation functions only non-linear activation functions allow such networks to compute non-trivial problems we need these activations to create some sort of bump a disturbance that will allow networks to solve all sorts of problems like this one here but we've known about activation functions for a long time but that wasn't enough something was not right with these functions Sigma antennas were by far the two most popular activation functions back then look at the blue lines and don't worry about the red lines for now these functions checked every single box we needed to train neural networks well almost every single box to tackle more complex problems like image recognition text generation and audio translation we needed deeper network but as soon as we try with more than a few layers neural networks wouldn't work at all let me show you this this here is the tensorflow playground anytime I want to play with neural networks I come here because I don't have to write any code and it's really easy for me to try any of my wacky theories that is here this is configured to solve the same problem I showed you before I'm gonna add a few more hidden layers to this network I'm gonna change the activation to Sigma point and then I'm gonna click play I did this before and I let it run for a long time before I stopped it over 5 000 iterations that the network could not solve the problem but wait that's not necessarily an issue right maybe sigmoid is not good enough to solve this particular problem except I run the same experiment but instead of using six hidden layers I used just two and the network solved it this was the thing preventing deep learning from becoming something we couldn't train deeper networks because they wouldn't work my feeling is if you want to understand a really complicated device like a brain you should build one that was Joffrey hinston's voice he played a central role in making deep learning a reality but to appreciate what happened we first need to understand why this activation functions didn't work with deep networks time to look at the red lines now these are the derivatives of the functions we use these gradients during back propagation to update the weights of the network the deeper the network the more iterations we need I'm not gonna get too deep into the math here but if the gradients are smaller than one and you multiply a bunch of them the results will get smaller and smaller look at the gradients of these two functions the maximum possible value of sigmoids gradient is 0.25 that's really small and for 10 inches one but that only happens at this particular point the gradient is very small everywhere else and that right there is the problem the deeper the network the smaller the updates get until they're so small that the network dies we call this the vanishing gradient problem and that's why these functions did not work work we should look at biology and we should try and make systems that work roughly like the brain okay it's 2010 Emperor comes out paper that proposes an idea so simple that it looks ridiculous they show how a function they call Rectify linear unit solves the problem they had with the other activation functions here is what the paper says Rectify linear units preserve information about relative intensities as information travels through multiple layers of feature detectors this was the function that's it this was the crucial missing piece Nair and Hinton wrote the paper and although they made this function popular I found references to it from decades before like in this paper from 1975. Fukushima doesn't give this function a name but that's the rectified linear unit in the context of neural networks but here is the most surprising part this function that works so well doesn't even meet one of the most basic requirements of an activation function this function is not differentiable so how come the simplest function that doesn't even meet the requirements is the one that makes everything works let's look at the laptop I have a very simple notebook here to plot the Rectify linear unit for for short we call it relu if I run this cell we get the chart here is the plot of the relu function the x-axis is the input to the function while the y-axis is the output of it notice how relu returns 0 for any negative input and it doesn't touch positive values at the point where x equals zero we cannot compute the derivative of the function and that's a big problem in theory it turns out that in practice we can return a specific value for that particular point and everything works fine here I'm returning zero and if I plot the derivative is the red line on the chart one final thing for you to notice on this plot the gradient is either zero for any negative values or one for positive values that means that value doesn't suffer from the vanishing gradient issue the signal propagating through the network will never disappear that's a major win and the main reason this simple function works so well by the way the gradient is only one of relu's advantages the function is very simple so it's really fast to compute and it doesn't saturate like sigmoid and 10 H2 it just works exceptionally well and thanks to it we have deep learning today and by the way when I think about this story and how well really works despite the simplicity I can't help but wonder what else are we missing today

Original Description

This is a story about the unreasonable effectiveness of the function that made deep learning possible. Citations: https://gist.github.com/svpino/8c34ecb612f9f66c13f7542a9e5043cc 🔔 Subscribe for more stories: https://www.youtube.com/@underfitted?sub_confirmation=1 📚 My 3 favorite Machine Learning books: • Deep Learning With Python, Second Edition — https://amzn.to/3xA3bVI • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — https://amzn.to/3BOX3LP • Machine Learning with PyTorch and Scikit-Learn — https://amzn.to/3f7dAC8 Twitter: https://twitter.com/svpino Disclaimer: Some of the links included in this description are affiliate links where I'll earn a small commission if you purchase something. There's no cost to you.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Underfitted · Underfitted · 21 of 60

← Previous Next →

Test-Time Augmentation In Machine Learning.

Test-Time Augmentation In Machine Learning.

Don't Replace Missing Values In Your Dataset.

Don't Replace Missing Values In Your Dataset.

Introduction to Adversarial Validation In Machine Learning.

Introduction to Adversarial Validation In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Active Learning. The Secret of Training Models Without Labels.

Active Learning. The Secret of Training Models Without Labels.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

The Confusion Matrix in Machine Learning

The Confusion Matrix in Machine Learning

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

I can predict cars CRASHING. And it's 99% accurate!

I can predict cars CRASHING. And it's 99% accurate!

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

The BEST Machine Learning Interview Strategy.

The BEST Machine Learning Interview Strategy.

OpenAI’s Whisper is AMAZING!

OpenAI’s Whisper is AMAZING!

5 Lessons You’re NOT Taught in School

5 Lessons You’re NOT Taught in School

TensorFlow On Apple Silicon. Step-by-Step Instructions

TensorFlow On Apple Silicon. Step-by-Step Instructions

Generating Images From Text. Stable Diffusion, Explained

Generating Images From Text. Stable Diffusion, Explained

The Wrong Batch Size Will Ruin Your Model

The Wrong Batch Size Will Ruin Your Model

8 Mistakes Holding Your Career Back | Machine Learning

8 Mistakes Holding Your Career Back | Machine Learning

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

Bias and Variance, Simplified

Bias and Variance, Simplified

Should You Stop Splitting Your Data Like This?

Should You Stop Splitting Your Data Like This?

The Function That Changed Everything

The Function That Changed Everything

This Model Caused A Nuclear Disaster

This Model Caused A Nuclear Disaster

Will Your Code Write Itself?

Will Your Code Write Itself?

The Simplest Encoding You’ve Never Heard Of

The Simplest Encoding You’ve Never Heard Of

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Can you become a Data Scientist without a Ph.D?

Can you become a Data Scientist without a Ph.D?

How to 10x your productivity with ChatGPT?

How to 10x your productivity with ChatGPT?

Cheating the Prisoner's Dilemma

Cheating the Prisoner's Dilemma

We integrated OpenAI's Whisper with Spot

We integrated OpenAI's Whisper with Spot

The Machine Learning School program

The Machine Learning School program

We integrated ChatGPT with our robots

We integrated ChatGPT with our robots

Solving complex tasks using a Large Language Model (LLM)

Solving complex tasks using a Large Language Model (LLM)

5 problems when using a Large Language Model

5 problems when using a Large Language Model

We just discovered faster sorting algorithms!

We just discovered faster sorting algorithms!

The 3 most important updates to OpenAI's API.

The 3 most important updates to OpenAI's API.

People are divided! Does GPT-4 understand what it says?

People are divided! Does GPT-4 understand what it says?

How much should you charge hourly as a Machine Learning freelancer?

How much should you charge hourly as a Machine Learning freelancer?

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

How to evaluate an LLM-powered RAG application automatically.

How to evaluate an LLM-powered RAG application automatically.

Step by step no-code RAG application using Langflow.

Step by step no-code RAG application using Langflow.

I built a simple game using Langchain. Here is a step by step tutorial.

I built a simple game using Langchain. Here is a step by step tutorial.

I used the first AI Software Engineer for a week. This is happening.

I used the first AI Software Engineer for a week. This is happening.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to train a model to generate image embeddings from scratch

How to train a model to generate image embeddings from scratch

Building an AI assistant that listens and sees the world (Step by step tutorial)

Building an AI assistant that listens and sees the world (Step by step tutorial)

Why are vector databases so FAST?

Why are vector databases so FAST?

A Machine Learning roadmap (the one I recommend to my students)

A Machine Learning roadmap (the one I recommend to my students)

How to build a real-time AI assistant (with voice and vision)

How to build a real-time AI assistant (with voice and vision)

An introduction to Mojo (for Python developers)

An introduction to Mojo (for Python developers)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

Building a CI workflow for those who hate it (using GitHub Actions)

Building a CI workflow for those who hate it (using GitHub Actions)

How to run Python Code in Mojo 🔥

How to run Python Code in Mojo 🔥

AI will not take your job. Here is what I think will happen instead.

AI will not take your job. Here is what I think will happen instead.

How to fine-tune a model using LoRA (step by step)

How to fine-tune a model using LoRA (step by step)

Late initialization in Mojo🔥 (Python doesn't support this)

Late initialization in Mojo🔥 (Python doesn't support this)

The $1,000,000 problem AI can't solve

The $1,000,000 problem AI can't solve

A gentle introduction to RAG (using open-source models)

A gentle introduction to RAG (using open-source models)

Automating feedback using ChatGPT and Zapier

Automating feedback using ChatGPT and Zapier

The video tells the story of how the Rectified Linear Unit (ReLU) activation function enabled deep learning to solve complex problems. It covers the history of deep learning, the limitations of earlier activation functions, and how ReLU overcame the vanishing gradient problem. By the end of the video, viewers will understand the importance of ReLU in deep learning and how to implement it in a neural network.

Key Takeaways

Understand the concept of activation functions in neural networks
Learn about the limitations of earlier activation functions, such as sigmoid and tanh
Implement ReLU in a neural network
Train a neural network using ReLU to solve a complex problem
Understand the vanishing gradient problem and how ReLU overcomes it

💡 The Rectified Linear Unit (ReLU) activation function is a simple yet powerful function that enables deep neural networks to solve complex problems by overcoming the vanishing gradient problem.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Neural Network Basics

View skill →

How to Use Tensorflow for Classification (LIVE)

How to Use Tensorflow for Classification (LIVE)

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

How to Make a Neural Network (LIVE)

How to Make a Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

Identify Horses or Humans with TensorFlow and Vertex AI

Understanding AI from Scratch – Neural Networks Course

Understanding AI from Scratch – Neural Networks Course

freeCodeCamp.org

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train