Activation Functions

Data Skeptic · Beginner ·🧬 Deep Learning ·9y ago

Skills: ML Maths Basics80%Supervised Learning70%ML Pipelines70%Unsupervised Learning60%

Key Takeaways

The video discusses the concept of activation functions in neural networks, including the hyperbolic tangent function, sigmoid function, arc tangent, step function, and rectified linear unit (ReLU), and their applications in introducing non-linearity and optimizing neural networks.

Full Transcript

[Music] data skeptic is the official podcast of datas skeptic.com bringing you stories interviews and manyi episodes on topics in data science machine learning statistics and artificial [Music] intelligence a quick correction before we get started when describing activation functions in this episode I say arctan a few times when I meant to say hyperbolic tangent function or tan H when dealing with real numbers they look almost identical and I think they might perform about the same maybe there's some computational reason to use one over the other I have to think about that that's a good question but hyperbolic tangent is what you see in the literature more often so please do an audible find and replace during this episode replacing arct tan with tan H if that sounds pedantic to you or if that sounds like gibberish don't worry about it lean back and just enjoy the episode M I've been baking yesterday I baked carrot cake and garbanzo bean banana bread which is quite good even though it doesn't sound good did you try any oh wait I'm thinking of something else you made with garbanzo beans what was it chocolate chip garbanzo beans what about the pavalova what is a pavalova by the way most people don't know it's like a mering and that it's beaten egg whites with sugar except a pavalova is hard on the outside soft on the inside and a mering is hard and dry all the way through and so a pava is just harder to make and requires more skill I.E watching the oven and cook I just looked for Paleo recipes and then I've been using fake sugar to see how it's turned out all right first why paleo what's your interest there because I want to cook with whole grains and I want to cook with less sugar and more protein and more fiber and just more fruits and veggies now talk to me more about this alternative sweetener stuff you get yourself into well I pronounce it aiol I don't know if that's really how it's pronounced I've never heard anyone say it it is a sugar alcohol that has zero impact on your calories and blood glucose level and uh how do you know how much to put in on the package it says one for one measurement of sugar but I think it's less sweet than sugar so I just keep tasting it and adding it as I feel so tell me about that process how do you go about doing that well I stir in and then I either get a spoon out or or just use my finger and I lick it all right so anyone whoever gets offered a sweet by Linda just be aware finger was in there I wash my hands very sterile if it's not sweet enough do you just double the amount of aiol no I just keep adding a little bit at a time why not just dump the whole box in see what happens well I've definitely added too much salt and too much sugar to things so you always want to be careful and add it slowly you can't take it out yeah so once you get kind of in the close range then you want to move more slowly right let's say if the correct amount to include in the recipe is two cups and at first you try half a cup you might not even taste the sweetness now you got to dump a bunch more in right well I just like to go slow the whole way really you would just put like one grain of Sweetness in at a time no one can measure one grain someone unless you're an ant yeah so you got this range of values you have to explore you know anywhere between minus infinity and positive Infinity but there's definitely a little range that's most relevant to you right so there's kind of like a a Midway point that is your best guess and then you want to explore around that Midway point and that is slightly analogous to the idea of an activation function so here's how these things work primarily in neural networks you have some input signal so maybe in our Network the inputs it takes are the amount from each ingredient how much sugar how much salt how much flour etc etc and the output we want to predict is how good it tastes on a scale of 0 to one something like that where one is a perfect taste and zero is completely unpalatable have you ever had any zeros anything ever completely unpalatable you cooked definitely think there have been things I cooked and I was like yeah I'm going to eat that and didn't touch it so I eventually threw it away I don't remember you botching anything maybe there were things that we ate slower than other things I think we just didn't like it I can't say you ever just like ruin the ingredients but if you had a process that might help you do that uh correct for that a little bit learn faster would be to apply an activation function to how you explore the space you remember the terms domain and range from your early math classes domain what is a domain domain is the possible X values range I feel like range is just a set of numbers that is in between what do you mean by range range is the set of possible outputs of your function or maybe think of it as the Y values you remember the sine wave right the sine wave has an infinite domain because you can put any value in but the sign function always gives you a number between 0 and one it has this nice feature of no matter what the inputs are it maps to 0 to one that's a useful kind of property right so whatever somebody gives you you know it's confined to some range we have something similar that we use in activation functions the sigmoid function which we talked about before related to logistic regression the sigmoid and there's going to be pictures of these in the show notes if anyone wants to check it out as it goes to the more extreme values like negative Infinity positive Infinity changing the X doesn't change the y very much it stays the same but around zero changes in X are in places like a little bit Amplified or at least they have a big impact so if you were trying to tune something and it was going through a sigmoid function then most of your the effectiveness of your tuning is going to happen right around the zero point where the bias is for that function so the activation function does a couple of things for us for any input we have it's uh you take in all those inputs and it's like a pass through it Maps the data from from one set of numbers to another usually in a bounded range you in a neural network when you're training it when you're getting it to learn things everything is kind of in a state of flux because as you're optimizing it you're potentially changing all the neurons every time and you need some amount of consistency because you don't want to just say like oh double the sugar over there oh now we have to double the flour to compensate and the recipe gets all crazy and maybe you're just jumping around different proportions of the same recipe you know like double the whole thing triple the whole thing you'd rather have one canonical recipe and optimize to that one of the steps in finding a way to do that making the neural network learn very well is to confine all the outputs to some nice range like 0 to one that way you don't have these unbounded values where something keeps getting bigger and bigger and bigger and maybe in the next step it has to go more and more negative to compensate or something like that these activation functions we commonly use also have nice properties I mentioned sigmoid Maps between zero and one can you guess why that might be useful well it's always the same range so you could compare it that's true and it's a little bit like a probability right you could say you know maybe it's the output represents you know the degree to which it was had the right amount of salt or something and uh at one would be the perfect amount at zero would be completely the wrong amount and as you explore the space you get closer and closer as you get to the you know best optimized value now there are other ones like the arc tangent that one is a similar shape but it varies between minus1 and 1 one now that one's interesting because now you have not only a a gradient from 0 to one of you know possible values how important something is but you also have the negative space so from negative 1 to one so you could kind of say like oh this recipe has too much sugar I need to turn that down you can't represent you know take something backwards if you only have positive numbers that's kind of how we use a negative number so in cases where maybe you want to penalize for something you would use something like an arc tangent where it can be minus1 to 1 and that would tell you like oh you want to do less of one particular thing if it was a negative value similarly there are things like the step function that would take whatever the input is so looks all the the incoming data and says like all right if you're above some threshold will give you the value of one if you're below that threshold you get the value of zero and there's no real in between it's a very onoff switch kind of a binary thing so that might be useful in the recipe scenario if you're deciding like hey maybe I should add a mystery ingredient like should I put cherries in my Pavlova in it most people don't they put it on top so then in that case if you tried a recipe where the cherries were inside or at least you offered that as a possible input maybe that Network would learn with a step function nope set that to zero Whenever there are cherries present I'm going to say I don't like this or at least that neuron is going to Output a zero to kind of represent the fact fact that no added value came from putting the cherries in another popular one uh just to mention we'll probably do a whole episode on this one in particular and why it's good is railu reu which is this weird abbreviation for rectified linear unit that one if you're at or below zero it sets the value to zero otherwise it leaves it alone so it has this kind of like nice gating property where it can kind of ignore certain inputs just if they become zero or negative it turns them off anything positive it leaves it alone so these activation functions are there to serve different purposes and help your network optimize in the right direction I think sigmoid arct tangent Ray step these are some of the popular ones but seems like there's tons of these out there that people are exploring all the time and they're really good for helping you kind of confine the inputs you're given into a useful range and allow the optimization of the network to go a little bit more smoothly when do you use them I use these well technically some of these ideas are buried in a lot of different machine learning algorithms but the places where I'm consciously thinking about activation functions are when I'm doing deep learning applications so I'll put different types of activation functions on different layers of the network to try and get certain effects you know for example in language I found I was starting out doing some stuff with language using all sigmoids because that's what I was used to and it would kind of say if something was present or not you know or or how much a word how important a word was to a sentence or to a document but then I discovered that some things actually kind of need to negate the value of the document and so when I started putting arc tangent in there the network was able to capture some of those properties you know like the word not is especially tricky in language I do not like your Pavlova is a very different statement from I like your Pavlova they differ by only one word but the presence of that word contributes you know this negative sentiment if you will so you need something that can capture that negative kind of um Association that you'd have with an input and that's where I started using arc tangent to good success so I just need to know what do you think of my latest carrot cake oh it's delicious have you tried it yeah yeah I had a piece yesterday and a piece uh this morning the carrot cake yeah you like it is it not sweet enough well listen let's define like here what are we talking about what what's your objective function I'm just asking a personal opinion question well we can be optimizing for different things do I like it is a different question of should we serve it when guests come over well what's your answer to B let's take those off the air I must know you could edit it out but you have to answer no it's good it's good I don't know if I'd serve it for guests but I like it but if I put icing on it well you could put icing on just about anything and you serve it that's no big deal then I could serve it to guest yeah that's no problem then you're just eating icing who cares that's what I was thinking I think I should make icing for it so you could trick people into finishing it off cuz you made a giant pan of it I just thought it would taste better with icing yeah a little bit of icing might help now if you had made a whole ton of these and made them all in different ways and use the neural network to learn which was the best recipe maybe the step function would be appropriate here should I add icing yes or no or maybe it should be the uh sigmoid you know which would kind of measure the deg the amount of icing that should be involved but by just doing it randomly you never build the underlying mathematical model of how to make a delicious whatever it is you're making how does one build that well um what are your control inputs they are all the quantities of ingredients and then a few things like the temperature you cook it at the time you cook it for and those all have a range of options right actually this is a great example of where you use an activation function so one thing you have to consider is how long do I bake it for and your choice is between zero minutes and infinity minutes because of course as far as I can tell the physicists have yet to figure out how to negative bake something so we'll start at zero goes to Infinity but of course baking it for 8 hours versus 80 hours probably is going to have the same effect I would imagine right mhm so have used a sigmoid you can help to kind of isolate the useful range to explore and where you um your optimization is likely to look or look more readily to find a good value that's a descriptive of that your optimization can use when it's searching for the ideal bake time so if you're going to use machine learning in some way you would just set all those inputs up provide a bunch of examples of things you've done in the past and then score each of them say you know how delicious it was and it can try and learn the right combin of ingredients and other inputs like cooking time and temperature that produces the most delicious version of the recipe I think by that time I would have gotten tired of cooking it and you would have gotten tired of eating it yeah that may very well be true uh it does take a lot of data to do machine learning in most cases but eventually we'll do a mini on oneshot learning which tries to balance out this major thirst for data that most algorithms have well thanks as always for joining me Linda thank you Kyle data skeptic is a listener reported program to support the show visit datas skeptic.com and click on the membership [Music] tab

Original Description

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow for non-linear properties to be introduced. Activation functions can also help to standardize your data between layers. Some functions such as the sigmoid have the effect of "focusing" the area of interest on data. Extreme values are placed close together, while values near it's point of inflection change more quickly with respect to small changes in the input. Similarly, these functions can take any real number and map all of them to a finite range such as [0, 1] which can have many advantages for downstream calculation. In this episode, we overview the concept and discuss a few reasons why you might select one function verse another.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 52 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video teaches the basics of activation functions in neural networks, including their types, applications, and importance in introducing non-linearity and optimizing neural networks. The video provides examples and use cases for different activation functions, making it a useful resource for beginners in machine learning.

Key Takeaways

Use sigmoid for comparing values and representing probabilities
Use arc tangent for representing both positive and negative values
Use step function for ignoring certain inputs
Use ReLU for ignoring negative values
Apply activation functions in neural networks
Use activation functions for classification problems

💡 Activation functions are crucial in introducing non-linearity in neural networks, and different functions have different applications and use cases.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

How to Choose the Best Deep Learning Model for Medical Imaging

Learn how to choose the best deep learning model for medical imaging to ensure the success of your medical AI project

Medium · Deep Learning

Another Way to Read Neural Geometry

Learn to read neural geometry from first principles using Goodfire's discovery and apply it to your deep learning projects

Medium · Data Science

Another Way to Read Neural Geometry

Learn to read neural geometry from first principles using Goodfire's discovery

Medium · Deep Learning

Building My First Neural Network From Scratch with PyTorch: A Journey on the Dry Bean Dataset

Build a neural network from scratch using PyTorch on the Dry Bean Dataset to understand deep learning fundamentals

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train