Probability for Machine Learning!
Key Takeaways
The video covers probability theory for machine learning, including linear regression, random variables, and maximum likelihood estimation, using tools like Zillow.com for data collection.
Full Transcript
hello everyone and welcome to another episode of Code Emporium where we're going to walk through some more math so in this video we're going to relate the concept of mathematical Concepts like random variables and probability distribution functions with machine learning there's going to be a lot of math and so let's get started so let's start with the machine learning piece here and say that we want to construct a model and this model is going to take in some inputs and outputs so this model is going to predict the price of a house given some information so let's say some of that information is let's say the number of bedrooms and let's say that it is also square footage of the house itself and then something else that we pass into it is the age of the house in years now given all of this information into the model the model is going to predict a price at which the house should sell now let's say that this model is going to be a linear regression model and because of that we can mathematically write out what the linear regression hypothesis would look like so let's say that the price over here which is what we want to predict this is going to be equal to I'm going to call the parameters of this linear regression model with the value Theta so let's say theta 3 this times the number of bedrooms in the house Plus Theta 2 this is another variable or rather the coefficient of the variable square footage here Plus Theta 1 times the age of the house itself and then we'll also add in a constant intercept term which is Theta naught and we will also add in an Epsilon error this is some irreducible error now this entire equation over here is called the linear regression hypothesis now the goal of like this entire model construction over here well what we need to do is to find the values of theta 0 Theta 1 Theta 2 and Theta 3. and we need to find these values given well the the bedroom square footage and price of multiple houses that constitute our training data and so let's actually talk about how we construct our training data mathematically to construct our training data let's actually conduct an experiment in this experiment we're going to go to zillow.com and collect our training data by looking at an individual house and then documenting all the features and the corresponding label and then moving on to the next house and repeating the process so for example we go to zillow.com randomly pick a house we have the first house let's say that this house has three bedrooms and the square footage of the house is 3025 square foot and this is a 10 year old house at a price of 757 000 dollars now once again we randomly go to another house on zillow.com this is house number two let's say that this house now had four bedrooms and it has 3 200 square footage of space this is an eight-year-old house and its price is 800 000 dollars and like so let's say that we document this these house prices for about 10 000 houses and so now we have constructed our training set now that we have a bunch of observations in our training set we can now start asking questions about these houses so for example let's write questions as a heading now one question that I would like to answer is well let's say how many bedrooms was there in house number two and for this question we are going to construct a random variable so I'm going to call first of all let's say the outcome we know the answer to this let's say the outcome of this specific question I'm going to call it Omega 2 1 because this is technically related to the house number two and it's going to be related to the first feature of house number two and this is four bedrooms and the random variable here let's call it x 2 1. now random variables are functions that map the outcome of an experiment to some measurable quantity and we could say that this is thus going to take Omega 2 1 as input and it's going to Output just a number in this case it's 4. now let's ask another question here about this data so now that we've got the number of bedrooms in house two how what was the size of house number two so what was square footage of house two similarly here we are going to use the variable Omega 2 2. the second House's second feature to an Omega 2 2 itself is the outcome here and the outcome well we observed it as being 3200 square feet and the random variable we're going to call now x22 and this is going to be well it's a function that takes Omega 2 2 and then creates a number a measurable quantity which in this case is 3200 which is representative of these square feet now let's create a third random variable over here and we're going to ask another question of so what was the age of house number two and here we are going to call this Omega 2 3. and this is going to be well what we observed which is eight years old and the random variable we're going to call x23 is going to take this outcome Omega 2 3 and map it to a number eight and let's say a final question for now that we want to ask here is for which is what is the price of house number two and this one I'm going to call just straight up Omega 2 we don't have that as a as an outcome variable name so I'm going to call Omega 2 and this is eight hundred thousand dollars so that's 800k and the corresponding random variable let's call it Y2 it's capital Y 2 here that takes this Omega 2 and it's going to be 800. I'm just going to leave out the thousands of dollars because typically these house prices are in increments of thousand dollars in listings anyways and so we have now taken an experiment which is our training data set and we're also now able to ask questions and get some quantifiable numbers from them with the help of random variables and with random variables now we can perform any kind of mathematics we want on them because we have numbers so before moving forward with the actual probability Theory and how this is related to that and how it's used to estimate parameters let's actually classify each of these random four random variables that we created here as being continuous or being discrete random variables so in this first case well the number of bedrooms here that's x21 it's going to be account and because this is a count this is going to be a discrete random variable similarly x22 over here this is the square footage square footage is a measurement and because it's a measurement this is going to be a continuous random variable we can have square footage that is also decimal points and now for the third question which represents the age of house number two the age of the house number two over here also is a measurement and so can be a continuous random variable n as 4 which is the price of house number two this can also be a well it's a measurement and so it is a continuous random variable over here now moving forward though we're interested in measuring the price of the house and so we're going to be very interested specifically in this term this um price of the house number two now I created four random variables that each correspond to some observation that we made in our training data set but in like the same way we can ask four questions and create four random variables for every single house that we observe in the training set so if we observe ten thousand houses in our training set we can create 40 000 random variables in kind of much the similar notation and when we have all these random variables we have all of these 40 000 numbers on which we can perform some mathematics to achieve our final goal of determining the parameters of a linear regression model that we're using to train and so let's actually move to that next step of linking these numbers to some probabilities all right so the goal here is now to estimate the parameters of theta 0 Theta 1 Theta 2 and Theta 3. and these parameters essentially are going to be the values that are going to maximize the probability of seeing our training data in the last section we saw training data and we want to maximize the probability of seeing every sample and hence compute the values of the thetas that maximize that probability of seeing every sample and so what we want to do here is Theta 0 1 2 and 3 is going to be the values with ARG Max so it's going to be the values that will maximize the probability of seeing our data sets so that's going to be the maximizing the probability of like y one this is random variable y1 being equal to some value that we observed in our training data set which I think was 757 000 dollars and then we have Y2 this is going to be another value which we saw is eight hundred thousand dollars and so on until y 10 000 because that's the size of our training data set and the number of observations that we have and all of this is well such that well we're we assume the values of theta 0 Theta 1 Theta 2 and Theta 3. [Music] now I'm also for the sake of completion of notation I'm going to write under ARG Max Theta 0 Theta 1 Theta 2 and theta 3 here too now this is clearly very cumbersome notation writing all of these thetas and so I'm going to simplify all of these thetas by creating a vector a Theta Vector that I represent as just Capital Theta [Music] in this case I'm all I'm going to recognize this as Theta hat mle because it is the prediction that we make of theta and mle stands for maximum likelihood estimation and this hat means it's a prediction so something to note here is that this P this capital P is a probability distribution function but it's a probability distribution function over a continuous random variable specifically y1 Y2 n capital y 10 000 and since this is a con these are continuous random variables this probability distribution function is more specifically called a probability density function now in the next section we're going to replace this notation to represent a probability density function as well as look into the concept of independently and identically distributed samples and what it means to this equation mathematically so first samples are independently distributed this means that for every house sample that we have taken the price of the house is independent of the price of all the other houses in our data set which is a very reasonable assumption to make and because of this assumption The Joint probability can be represented as a product of probability distributions gonna write triple dots over here since it's a product of 10 000 terms all the way up to y ten thousand and that is equal to the small y 10 000. given the value of theta all right so now we're also going to replace this more generic notation of P which is the probability distribution function with the probability density function since we know that each of these are well they are continuous random variables and to represent a continuous random variable we typically use the notation of f followed by a subscript of that random variable y i and so let's see how that looks here so Theta hat mle is going to be equal to the Ard Max over Theta where we take the first term as f now the subscript is going to be y1 since that's the random variable over which this probability density function applies and it's going to be a function of the value that it can take which is the small y one and this is going to be given Theta as usual and we write this for the 10 000 terms so now that we have an understanding of what independent distribution means we'll also look at the second part which is identical distributions and when we say that data is identically distributed well in this notation we see that there's 10 000 different probability density functions that we're taking a product of but in reality all of these probability density functions can be assumed to be the same what this means practically is that the probability that house number one that you choose is between seven hundred thousand dollars and eight hundred thousand dollars is equivalent to the probability ability dot house number two is also between seven hundred thousand dollars in 800 000 and this is the same for every single house and because of that we can kind of write all of these individual probability density terms as just being the same representation F subscript Y and so I'm gonna do just that samples are identically distributed and for mathematical notation standpoint it'll be the exact same as the previous statement that we wrote right here but we are only going to change the subscripts to so that they all match each other [Music] [Music] and so this is also going to be a subscript why but of the same Y2 variable and even for the 10 000 term well it's the same y and so I hope it's clear why the independent and identically distributed assumption that we typically see in machine learning actually is very useful in just simplifying the mathematics itself so let's write this in a much more concise form with product notation so that's data hat m l e this is going to be equal to ARG Max of theta and we're going to introduce the product notation over here which looks like a huge Pi symbol and I'm going to say the the iterator is going to be I that goes from 1 to 10 000 since that's how many training samples we have and it's going to be F of subscript y and what's being iterated is specifically well y i and this is given the value of theta so this here is actually a very good generic endpoint that depending on the machine learning model and algorithm that comes that we are considering we would simplify this further because that would require different assumptions of this probability density function over here now in this case we had a linear regression model and for linear regression we typically assume this to be a normal distribution so let me write that out for linear regression the f y term is assumed to be normally distributed where the mean here is actually the mean prediction of the model itself so let's assume that to be represented as y i hat and the standard deviation I can just let to be well Sigma and so we write Sigma squared for variance and if you substitute this value over here and you kind of expand this to be you know the probability distribution or density function of the normal distribution then you will see that this the value of theta is the value that is required to minimize the residual sum of squares so let me actually write that out more concretely over here so what this means is that this Theta hat m l e is going to be equal to Ard Max of over Theta of negative 1 times Sigma where I ranges from one to ten thousand of like y I this is the actual value minus the let's call it again now y i hat which is a predicted value whole squared and this term over here this term here is the residual sum of squares because the error is a residual that is y i minus y a hat it's a sum of those residuals and the sum of squared of those residuals so residual sum of squares now how exactly you get from this term to this term I'll probably leave it to you as an assignment for now but the hint is that you just need to replace this term with the normal distribution considering the mean as the prediction and Y is some constant Sigma squared and if you take logarithms on both sides technically the value that maximizes the logarithm is also the value that maximizes the function itself you will end up with this expression over here and we can write this out as ARG Min of the sum of residual squares as well so what this implies is that the value of theta that is optimal for our machine learning model is the value that minimizes the residual sum of squares and I hope this also explains why you see the residual sum of squares so often especially associated with linear regression and basically a lot of these other algorithms that rely on the normal distribution itself so thank you all so much for watching I have a related blog post and written format of everything that we've covered here and also the last five videos just in a blog post that should be in the link down in the description below posted on medium so please do follow me for updates on medium there as well as on my channel here and I appreciate the support thank you all so much for watching and I will see you in another one bye
Original Description
Here is all the probability theory you need for machine learning
⭐ Playlist for this probability in machine learning series (this was the 6 / 6th video): https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
MEDIUM
⭐ Blog post on probability fundamentals in Machine Learning: https://towardsdatascience.com/probability-for-machine-learning-b4150953df09
📕 Maximum Likelihood Estimation: https://towardsdatascience.com/likelihood-probability-and-the-math-you-should-know-9bf66db5241b
CHAPTERS
0:00 Linear Regression + Machine Learning
3:44 How Random Variables fit in
12:22 Maximum Likelihood + Probability Density Functions
16:18 Math derivation (with iid assumption, notation and more)
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Probability: https://imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning
📕 Python for Everybody: https://imp.i384100.net/python
📕 MLOps Course: https://imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP
📕 Machine Learning in Production: https://imp.i384100.net/MLProduction
📕 Data Science Specialization: https://imp.i384100.net/DataScience
📕 Tensorflow: https://imp.i384100.net/Tensorflow
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: ML Maths Basics
View skill →Related Reads
Chapters (4)
Linear Regression + Machine Learning
3:44
How Random Variables fit in
12:22
Maximum Likelihood + Probability Density Functions
16:18
Math derivation (with iid assumption, notation and more)
🎓
Tutor Explanation
DeepCamp AI