Probability for Machine Learning!

CodeEmporium · Advanced ·🔢 Mathematical Foundations ·3y ago

Key Takeaways

The video covers probability theory for machine learning, including linear regression, random variables, and maximum likelihood estimation, using tools like Zillow.com for data collection.

Full Transcript

hello everyone and welcome to another episode of Code Emporium where we're going to walk through some more math so in this video we're going to relate the concept of mathematical Concepts like random variables and probability distribution functions with machine learning there's going to be a lot of math and so let's get started so let's start with the machine learning piece here and say that we want to construct a model and this model is going to take in some inputs and outputs so this model is going to predict the price of a house given some information so let's say some of that information is let's say the number of bedrooms and let's say that it is also square footage of the house itself and then something else that we pass into it is the age of the house in years now given all of this information into the model the model is going to predict a price at which the house should sell now let's say that this model is going to be a linear regression model and because of that we can mathematically write out what the linear regression hypothesis would look like so let's say that the price over here which is what we want to predict this is going to be equal to I'm going to call the parameters of this linear regression model with the value Theta so let's say theta 3 this times the number of bedrooms in the house Plus Theta 2 this is another variable or rather the coefficient of the variable square footage here Plus Theta 1 times the age of the house itself and then we'll also add in a constant intercept term which is Theta naught and we will also add in an Epsilon error this is some irreducible error now this entire equation over here is called the linear regression hypothesis now the goal of like this entire model construction over here well what we need to do is to find the values of theta 0 Theta 1 Theta 2 and Theta 3. and we need to find these values given well the the bedroom square footage and price of multiple houses that constitute our training data and so let's actually talk about how we construct our training data mathematically to construct our training data let's actually conduct an experiment in this experiment we're going to go to zillow.com and collect our training data by looking at an individual house and then documenting all the features and the corresponding label and then moving on to the next house and repeating the process so for example we go to zillow.com randomly pick a house we have the first house let's say that this house has three bedrooms and the square footage of the house is 3025 square foot and this is a 10 year old house at a price of 757 000 dollars now once again we randomly go to another house on zillow.com this is house number two let's say that this house now had four bedrooms and it has 3 200 square footage of space this is an eight-year-old house and its price is 800 000 dollars and like so let's say that we document this these house prices for about 10 000 houses and so now we have constructed our training set now that we have a bunch of observations in our training set we can now start asking questions about these houses so for example let's write questions as a heading now one question that I would like to answer is well let's say how many bedrooms was there in house number two and for this question we are going to construct a random variable so I'm going to call first of all let's say the outcome we know the answer to this let's say the outcome of this specific question I'm going to call it Omega 2 1 because this is technically related to the house number two and it's going to be related to the first feature of house number two and this is four bedrooms and the random variable here let's call it x 2 1. now random variables are functions that map the outcome of an experiment to some measurable quantity and we could say that this is thus going to take Omega 2 1 as input and it's going to Output just a number in this case it's 4. now let's ask another question here about this data so now that we've got the number of bedrooms in house two how what was the size of house number two so what was square footage of house two similarly here we are going to use the variable Omega 2 2. the second House's second feature to an Omega 2 2 itself is the outcome here and the outcome well we observed it as being 3200 square feet and the random variable we're going to call now x22 and this is going to be well it's a function that takes Omega 2 2 and then creates a number a measurable quantity which in this case is 3200 which is representative of these square feet now let's create a third random variable over here and we're going to ask another question of so what was the age of house number two and here we are going to call this Omega 2 3. and this is going to be well what we observed which is eight years old and the random variable we're going to call x23 is going to take this outcome Omega 2 3 and map it to a number eight and let's say a final question for now that we want to ask here is for which is what is the price of house number two and this one I'm going to call just straight up Omega 2 we don't have that as a as an outcome variable name so I'm going to call Omega 2 and this is eight hundred thousand dollars so that's 800k and the corresponding random variable let's call it Y2 it's capital Y 2 here that takes this Omega 2 and it's going to be 800. I'm just going to leave out the thousands of dollars because typically these house prices are in increments of thousand dollars in listings anyways and so we have now taken an experiment which is our training data set and we're also now able to ask questions and get some quantifiable numbers from them with the help of random variables and with random variables now we can perform any kind of mathematics we want on them because we have numbers so before moving forward with the actual probability Theory and how this is related to that and how it's used to estimate parameters let's actually classify each of these random four random variables that we created here as being continuous or being discrete random variables so in this first case well the number of bedrooms here that's x21 it's going to be account and because this is a count this is going to be a discrete random variable similarly x22 over here this is the square footage square footage is a measurement and because it's a measurement this is going to be a continuous random variable we can have square footage that is also decimal points and now for the third question which represents the age of house number two the age of the house number two over here also is a measurement and so can be a continuous random variable n as 4 which is the price of house number two this can also be a well it's a measurement and so it is a continuous random variable over here now moving forward though we're interested in measuring the price of the house and so we're going to be very interested specifically in this term this um price of the house number two now I created four random variables that each correspond to some observation that we made in our training data set but in like the same way we can ask four questions and create four random variables for every single house that we observe in the training set so if we observe ten thousand houses in our training set we can create 40 000 random variables in kind of much the similar notation and when we have all these random variables we have all of these 40 000 numbers on which we can perform some mathematics to achieve our final goal of determining the parameters of a linear regression model that we're using to train and so let's actually move to that next step of linking these numbers to some probabilities all right so the goal here is now to estimate the parameters of theta 0 Theta 1 Theta 2 and Theta 3. and these parameters essentially are going to be the values that are going to maximize the probability of seeing our training data in the last section we saw training data and we want to maximize the probability of seeing every sample and hence compute the values of the thetas that maximize that probability of seeing every sample and so what we want to do here is Theta 0 1 2 and 3 is going to be the values with ARG Max so it's going to be the values that will maximize the probability of seeing our data sets so that's going to be the maximizing the probability of like y one this is random variable y1 being equal to some value that we observed in our training data set which I think was 757 000 dollars and then we have Y2 this is going to be another value which we saw is eight hundred thousand dollars and so on until y 10 000 because that's the size of our training data set and the number of observations that we have and all of this is well such that well we're we assume the values of theta 0 Theta 1 Theta 2 and Theta 3. [Music] now I'm also for the sake of completion of notation I'm going to write under ARG Max Theta 0 Theta 1 Theta 2 and theta 3 here too now this is clearly very cumbersome notation writing all of these thetas and so I'm going to simplify all of these thetas by creating a vector a Theta Vector that I represent as just Capital Theta [Music] in this case I'm all I'm going to recognize this as Theta hat mle because it is the prediction that we make of theta and mle stands for maximum likelihood estimation and this hat means it's a prediction so something to note here is that this P this capital P is a probability distribution function but it's a probability distribution function over a continuous random variable specifically y1 Y2 n capital y 10 000 and since this is a con these are continuous random variables this probability distribution function is more specifically called a probability density function now in the next section we're going to replace this notation to represent a probability density function as well as look into the concept of independently and identically distributed samples and what it means to this equation mathematically so first samples are independently distributed this means that for every house sample that we have taken the price of the house is independent of the price of all the other houses in our data set which is a very reasonable assumption to make and because of this assumption The Joint probability can be represented as a product of probability distributions gonna write triple dots over here since it's a product of 10 000 terms all the way up to y ten thousand and that is equal to the small y 10 000. given the value of theta all right so now we're also going to replace this more generic notation of P which is the probability distribution function with the probability density function since we know that each of these are well they are continuous random variables and to represent a continuous random variable we typically use the notation of f followed by a subscript of that random variable y i and so let's see how that looks here so Theta hat mle is going to be equal to the Ard Max over Theta where we take the first term as f now the subscript is going to be y1 since that's the random variable over which this probability density function applies and it's going to be a function of the value that it can take which is the small y one and this is going to be given Theta as usual and we write this for the 10 000 terms so now that we have an understanding of what independent distribution means we'll also look at the second part which is identical distributions and when we say that data is identically distributed well in this notation we see that there's 10 000 different probability density functions that we're taking a product of but in reality all of these probability density functions can be assumed to be the same what this means practically is that the probability that house number one that you choose is between seven hundred thousand dollars and eight hundred thousand dollars is equivalent to the probability ability dot house number two is also between seven hundred thousand dollars in 800 000 and this is the same for every single house and because of that we can kind of write all of these individual probability density terms as just being the same representation F subscript Y and so I'm gonna do just that samples are identically distributed and for mathematical notation standpoint it'll be the exact same as the previous statement that we wrote right here but we are only going to change the subscripts to so that they all match each other [Music] [Music] and so this is also going to be a subscript why but of the same Y2 variable and even for the 10 000 term well it's the same y and so I hope it's clear why the independent and identically distributed assumption that we typically see in machine learning actually is very useful in just simplifying the mathematics itself so let's write this in a much more concise form with product notation so that's data hat m l e this is going to be equal to ARG Max of theta and we're going to introduce the product notation over here which looks like a huge Pi symbol and I'm going to say the the iterator is going to be I that goes from 1 to 10 000 since that's how many training samples we have and it's going to be F of subscript y and what's being iterated is specifically well y i and this is given the value of theta so this here is actually a very good generic endpoint that depending on the machine learning model and algorithm that comes that we are considering we would simplify this further because that would require different assumptions of this probability density function over here now in this case we had a linear regression model and for linear regression we typically assume this to be a normal distribution so let me write that out for linear regression the f y term is assumed to be normally distributed where the mean here is actually the mean prediction of the model itself so let's assume that to be represented as y i hat and the standard deviation I can just let to be well Sigma and so we write Sigma squared for variance and if you substitute this value over here and you kind of expand this to be you know the probability distribution or density function of the normal distribution then you will see that this the value of theta is the value that is required to minimize the residual sum of squares so let me actually write that out more concretely over here so what this means is that this Theta hat m l e is going to be equal to Ard Max of over Theta of negative 1 times Sigma where I ranges from one to ten thousand of like y I this is the actual value minus the let's call it again now y i hat which is a predicted value whole squared and this term over here this term here is the residual sum of squares because the error is a residual that is y i minus y a hat it's a sum of those residuals and the sum of squared of those residuals so residual sum of squares now how exactly you get from this term to this term I'll probably leave it to you as an assignment for now but the hint is that you just need to replace this term with the normal distribution considering the mean as the prediction and Y is some constant Sigma squared and if you take logarithms on both sides technically the value that maximizes the logarithm is also the value that maximizes the function itself you will end up with this expression over here and we can write this out as ARG Min of the sum of residual squares as well so what this implies is that the value of theta that is optimal for our machine learning model is the value that minimizes the residual sum of squares and I hope this also explains why you see the residual sum of squares so often especially associated with linear regression and basically a lot of these other algorithms that rely on the normal distribution itself so thank you all so much for watching I have a related blog post and written format of everything that we've covered here and also the last five videos just in a blog post that should be in the link down in the description below posted on medium so please do follow me for updates on medium there as well as on my channel here and I appreciate the support thank you all so much for watching and I will see you in another one bye

Original Description

Here is all the probability theory you need for machine learning ⭐ Playlist for this probability in machine learning series (this was the 6 / 6th video): https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V MEDIUM ⭐ Blog post on probability fundamentals in Machine Learning: https://towardsdatascience.com/probability-for-machine-learning-b4150953df09 📕 Maximum Likelihood Estimation: https://towardsdatascience.com/likelihood-probability-and-the-math-you-should-know-9bf66db5241b CHAPTERS 0:00 Linear Regression + Machine Learning 3:44 How Random Variables fit in 12:22 Maximum Likelihood + Probability Density Functions 16:18 Math derivation (with iid assumption, notation and more) MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i384100.net/python 📕 MLOps Course: https://imp.i384100.net/MLOps 📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP 📕 Machine Learning in Production: https://imp.i384100.net/MLProduction 📕 Data Science Specialization: https://imp.i384100.net/DataScience 📕 Tensorflow: https://imp.i384100.net/Tensorflow
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →
1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video teaches the probability theory needed for machine learning, covering topics like linear regression, random variables, and maximum likelihood estimation. It provides a comprehensive understanding of the mathematical foundations of machine learning.

Key Takeaways
  1. Collect training data using tools like Zillow.com
  2. Construct a linear regression model to predict outcomes
  3. Define random variables to represent outcomes of experiments
  4. Use maximum likelihood estimation to estimate model parameters
  5. Assume independent and identically distributed data
  6. Use product notation to simplify mathematics
  7. Substitute the normal distribution for the probability density function
💡 The normal distribution is often used in linear regression and other algorithms due to its simplicity and the fact that the residual sum of squares is a convenient metric to minimize.

Related Reads

Chapters (4)

Linear Regression + Machine Learning
3:44 How Random Variables fit in
12:22 Maximum Likelihood + Probability Density Functions
16:18 Math derivation (with iid assumption, notation and more)
Up next
How to Open OSM Files (OpenStreetMap Data)
File Extension Geeks
Watch →