How Bayes Theorem works

Brandon Rohrer · Beginner ·📰 AI News & Updates ·9y ago

Skills: ML Maths Basics90%

Key Takeaways

The video explains how Bayes Theorem works, providing examples of Bayesian inference and its application to real-world problems, using concepts such as conditional probabilities, joint probabilities, and marginal probabilities.

Full Transcript

beigan inference is a way to make guesses about what your data mean based on sometimes very little data the way it works is tricky but it's not magic it's definitely something that you can wrap your head around and it's not impossible to do so my goal is that by the time we're done talking you'll have a pretty crisp picture of how it works beian inference is just guessing in the style of Thomas baze who was a non-conformist presbyterian Minister he wrote a couple of books one about religion and one about probability so a beian inference is just guessing in the style of bay so to illustrate it imagine that you're at the movies and someone drops a ticket you pick it up and you can see them from behind you know they have long hair but you don't know whether they're a man or a woman so you have to make a guess based on what you know about the attendees at your movie theater you might say excuse me ma'am is this your ticket now imagine instead that this person is standing in line for the men's restroom knowing this extra piece of information you might make a different guess ban inference is a way to capture this Common Sense knowledge about the situation and help you to make better guesses so to put numbers to this dilemma at the movie theater let's assume out of 100 women at the movies 50 have short hair 50 have long and out of 100 men at the movies 96 have short hair and four have long in this case we can see that there are definitely more women with long hair than men with long hair so it's a safe bet to assume this person is a woman now we just made a subtle assumption that there are about the same number of men and women at the movies this assumption no longer holds when we move to the men's restroom line here let's say there are uh two women out of every hundred people and 98 men maybe uh women keeping their Partners company there's still one with short hair and one with long hair still half and half long and short hair but now there are four times as many men with long hair than women with long hair in this group now the safe money is to bet that this person is a man so to draw this a little differently out of a 100 people at the movies overall will make this assumption explicit that 50 of them are women 50 of them are men so this is how the different categories break down in the line for the men's restroom then they break down a little differently so to translate this to math the probability that a person person is a woman is the total number of women divided by the total number of people 50% similarly for men moving to the men's restroom line the probability that someone as a woman is 2% 98% for men now Bay's theorem is a little bit tricky so to be very precise we're going to have to talk math so if you bear with me for just three probability Concepts we'll let L the foundation for presenting Bas theem the first one is conditional probabilities if I know that a person is a woman that's the condition what's the probability that that person has long hair so it's written as probability of long hair given that a person's a woman so to get this we just divide the number of women with long hair by the total number of women 50% and this doesn't change whether there's 50 women in uh in their group or two women in the group still if we know that a person is a woman the probability that they have long hair is 50% we can do the same thing with men probability that someone has long hair given that they're a man is 4% so conditional probabilities if I know that b is the case what's the probability that a is also the case this is not you can't reverse B and a and have this be true so for instance if I know that the thing I'm holding is a puppy what's the probability that it's cute the probability is very high if I know that the thing I'm holding is cute what's the probability that it's a puppy well it might be a puppy might be a kitten it might be a hedgehog it might be a small human there's lots of things that it could be so the probability there is less moderate so these things are not interchangeable in conditional probabilities Now concept two joint probabilities so what's the probability that a person is both a woman and has short hair uh so to calculate a joint probability you first find their conditional probability well if I know that they're a woman what's the probability that they have short hair and then you multiply that by the probability that there are a woman so in this case .5 * 5 we get a 025 which is exactly what we knew it was going to be and the same is true for all of our different conditions so we can do this uh for the men's restroom too the probability that someone is a man and has long hair 4% someone is a woman and has long hair 1% joint probabilities are different than conditional probabilities here the probability that A and B is the case is the same that the probability that b and a is the case so the probability that I'm having a jelly donut with my milk is the same as the probability that I'm having a milk with my jelly donut these two conditions these two situations are identical so we can swap the order and finally concept three marginal probabilities if I wanted to say figure out the probability that someone has long hair I just add up all of the different ways that someone can have long hair they can be a woman with long hair or a man with long hair in the men's restroom line that's a 1% probability plus a 4% probability or a 5% probability overall you can do the same thing for short hair 95% now this last concept finishes our foundation we can get to what we really care about we know that this person has long hair what's the probability that they are a man or a woman this is a conditional probability but it's the reverse of the one that we know and we don't know how to answer this yet so this is where Thomas baz comes in he noticed something really cool you can calculate the joint probability that someone is a man and has long hair using the formula we saw before and you can also calculate the joint probability that someone has long hair and is a man now these are different formulas but remember joint probabilities don't care about the order so these two things are equal which means the stuff that they're equal to on the other side are also equal to each other and we can do a little algebraic slide of hand and now we have a formula for what we want to know if someone has long hair what's the probability that they're a man and we have this expression to the right we can uh genericize that with A's and B's and now we have Bay theorem one of the top 10 most popular math tattoos of all time so using this formula we can go back to the movie theater and plug in what we know we know that the probability that someone is a man we know the probability that if they're a man they have long hair and we know the conditional probability or sorry the um marginal probability that someone has long hair which is just the probability that someone's a woman with long hair plus probability that someone's a man with long hair and we plug all that in and we can say if someone has long hair at the movie theaters there is a 7% chance that they are a man similarly 93% chance that they are a woman now if they're in line for the men's restroom because some of those probabilities change that conditional probability changes someone's in line for the men's restroom and has long hair there's an 80% chance that they are a man and this is consistent with what we saw before we know that there are four men and one woman for every hundred people in a line for the men's restroom that have long hair so four out of five long haired people are men 80% it all adds up so this example shows uh the mechanics of how to get Bas theorem and how it works in practice it's usually used a little differently so to show this we'll have to do a little bit of a detour and first talk about probability distributions you could think of probability like a pot with just one cup of coffee in it you can fill up if you just have one cup to fill up you can fill it all the way to the top but if you have more than one cup you have to share it around or distribute it and you can share it in any proportion you want so for instance if we're representing the number of men and women at the movies we could share it 50/50 but it'll always add up to 100% we could even share it further into different categories so here we see the joint probabilities of all of our four different categories that we were working with and you can see that this is just another representation of the uh distribution representation that we were looking at before now usually when we look at this they're side by side uh probability instead of percentage and uh shown in a histogram like this it can be really helpful to think of these as beliefs so for instance if I flip a coin and hide the result from you you might half believe its heads and half believe its tails until I tell you what it is if I roll a die and hide the result from you you may believe about one sixth that it's a one or a two or a three or four or five or six until I show you the result so these are what you believe probabilities can represent what you believe about something before you measure it similarly for Powerball tickets and even for things that are more complicated like let's say the height of adults in the United States in centimeters you might believe that there's a very small chance that they'll be over 210 cm and a smallish chance that they're less than 150 cm and then assign various amounts of this belief to all of the ranges in between it still adds up to one it's like all a bunch of cups of coffee lined up in a row and you put a little bit in each one the cups in the middle have more um and it shows how your belief about some individual is distributed before you've actually measured them now you can take these bins and chop them more and more finely again and again and if you keep doing this you can get to something that's actually perfectly smooth so it's as if you had uh an infinite number of very tiny cups and you put a tiny bit infinite decimal amount of coffee in each one at this point it's probably no longer helpful to think of it in that terms but just thinking of it as a continuous distribution showing for all these Heights where am I placing my bets what do I believe and how much so once you have your beliefs you can use it to answer questions about typical Heights the average the median value the most common value or the mode now we'll use this in weighing my dog um I have a shih tzu named rain of terror um she's a little mischievous and when we go to the veterinarian rain squirms on the scale so every time we weigh her we get a different weight this last time we got 13.9 lbs 17 1/2 lb and 14.1 lb what we want to know is how much rain weighs and this will be the basis for a decision this is important if her weight has gone up her food intake will have to go down and for her this is a matter of life and death so we don't want to make the wrong assumption and draw the wrong conclusion so if you've ever taken a statistics class before you know you can take these measurements add them up get the average 15 2 lb you can calculate the standard deviation of these three measurements and also the standard error and come up with a 1.16 lb standard error which when you show it graphically this red curve now shows the belief that results from those three measurements the distribution the peak of that Hill is at 15.2 lbs and one standard deviation on that curve is our standard error of 1.2 lb so you can see looking at this that yes it's most likely that she's 15.2 lb but there's a lot of that curve that sits outside of the range of 14 to 16 so yeah she's probably between 14 and 16 lb most likely between 13 and 17 lb but she might even be lower than 12 and higher than 18 that is a we really why range and it's not a great basis for making a decision now you can see the three measurements there those three white vertical lines and you can see why our belief is so distri uh dispersed because those three measurements are pretty dispersed it's hard to capture all that in one distribution so let's try it again with Bay theorem so the way we'll do this is instead of A and B we'll substitute in W for her actual weight and M for the measurements that we took now this term over here the probability distribution of the actual weight is our prior this is what we believe about her weight before we put her on the scale the probability given a weight of getting certain measurements are the likelihood associated with those measurements and then the posterior is what we believe about her weight given those measurements so you can think of this as we start with a belief we take some measurements and we update it and then we have a new belief when we're done this term on the bottom we're going to ignore for the most part it'll be a constant but it's called the marginal likelihood so we're going to start by not assuming anything about her weight could be one PB 10 lb 20b 100 lb we're going to let this be uniform and we're going to let the data speak so now our formula looks like this we can further simplify it and so we want to calculate this we want to calculate the probability of our measurements occurring given a we and we want to do this for all of the possible weights and then we'll end up with a new distribution which is our belief what's the probability of each of those weights occurring given the measurements so these two things are identical so let's start for instance by assuming what if she weighed 17 lbs in reality now because our measurement process is very noisy as we saw if she weighed 17 lb we would expect those measurements to be distributed as shown here some would be up way above 18 lb some would be down around 14 lb where we actually measured but not very many of them would be so to calculate now the probability of our measurements occurring given this weight we find what the probability of each individual weight is of occurring and we multiply that times that times that now these two are pretty small when you multiply two small things together they make something very small so the probability of her being at 17 pounds is is pretty small we shift our belief over and say well what if she was 162 lb what if she was 16 lb and we recalculate it each time multiplying all of those actual probabilities together and then by the time we're done this is what we've measured at each of those weights this is the likelihood of each of those occurring and you can see that the maximum likelihood occurs at 15.2 lb um and this is commonly called the maximum likelihood estimate where you start with a uniform assumptions on your weight um and it just so happens that the standard error on this is exactly what we calculated before a very cool thing connection here when you take the average and calculate standard deviation and standard error it gives you the likelihood that you would get by doing Bay method and assuming a uniform prior not assuming anything about what the results going to be so we've already established though that that's a really broad result and not helpful so we need to start over now and let's start with what we know so some background information rain was 14.2 lb the last time we went into the vet and she doesn't seem noticeably more heavy to me my arm is not that well calibrated but let's I'm going to assume that she's within about a pound of where she was before so I take that prior and this is the form that it takes a normal distribution centered on 14.2 lb and you can see that most of that bulk is within plus or minus a pound and it extends a little bit further out I allow for the possibility that she's actually gained a lot or or lost a lot of weight but probably she's pretty close this is what I believe before I even put her on the scale this is the probability the prior the probability of her being a given weight so this time we're not neglecting the prior term we're not setting it constant we're going to use it and the way this plays out now is we assume okay what if she were 17 lb like well we need to multiply that now by the probability of our prior showing that she's 17 lb which actually makes that quite small now we calculate and multiply the three probabilities of our measurements occurring so now we have something small times something very small times something very small so we get a very small result uh probability that she will that she actually weighs 17 lb and now we repeat this process at 16 1 12 lb and 16 lb and 152 lb and 15 lb all the way through and then by the time we're done we tally up all of those and we get this new posterior distribution um it's normally distributed at about 14.1 lbs and it has a standard error of less than a pound you'll notice it's even narrower than our original uh prior so we've taken our original belief and we've been able to sharpen it up just a bit and so incidentally the peak of this curve is called the maximum of posteriori result if we had to choose one value to represent our belief that's not a bad one to choose and now we compare this with our original estimate it's labeled non beian here but more accurately it could be Bean with a uniform prior you can see that it is much broader and also the peak of that curve is in an entirely different place so the answer that we got it's more confident because it's more centered and it's probably based on what we know closer to being correct so this is how Bay theorem is used most often in data science or in analysis it's a prior that you then update based on your measurements to sharpen up and um get get a revised set of beliefs so there's a lot of times that it makes sense to use beigan inference um sometimes we just know things like if we're measuring age we know that everyone is more than Z years old and so we can take that information and build it in and we can get sharper estimates with fewer measurements now so it should if you're paying attention make you a little bit nervous um we humans are actually not always aware of what we believe and putting it into a mathematical distribution can be very tricky more importantly the reason we're measuring something is because we want to learn about it we want to be able to be surprised by our data so if we believe something that's not true it can make it hard or impossible to learn from our data I like how Mark Twain phrased this he says it ain't what you don't know that gets you into trouble it's what you know for sure that just ain't so so the way to avoid this pit fall is to always believe things that we think are impossible at least just a little bit so by leaving this room for something to be impossible we can do like uh Sherlock Holmes says and once you've excluded The Impossible then whatever remains however improbable must be the truth we don't want to exclude the improbable out of hand because then we're left with nothing Alice in a conversation with the Red Queen summed it up well too there's no use in trying she said one can't believe impossible things I dare say you haven't had much practice said the queen when I was younger I always did it for half an hour a day why sometimes I've believed as many as six impossible things before breakfast so the secret to us using beian inference well is to keep believing impossible things thanks for your attention here's how you can get in touch with me if you'd like to carry on the conversation I look forward to talking with you again soon

Original Description

Part of the End-to-End Machine Learning School Course 191, Selected Models and Methods at https://e2eml.school/191 A walk through a couple of Bayesian inference examples. The blog: http://brohrer.github.io/how_bayesian_inference_works.html The slides: https://docs.google.com/presentation/d/1325yenZP_VdHoVj-tU0AnbQUxFwb8Fl1VdyAAUxEzfg/edit?usp=sharing Follow me for announcements: https://twitter.com/_brohrer_

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Brandon Rohrer · Brandon Rohrer · 23 of 60

← Previous Next →

Robot Learning with a Biologically-Inspired Brain (BECCA)

Robot Learning with a Biologically-Inspired Brain (BECCA)

BECCA talk at AGI 2011

BECCA talk at AGI 2011

Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel

Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel

BECCA listens to The Hobbit

BECCA listens to The Hobbit

Learning the building blocks of speech: BECCA extracts a hierarchy of audio features

Learning the building blocks of speech: BECCA extracts a hierarchy of audio features

BECCA listens for sound effects in The Hobbit

BECCA listens for sound effects in The Hobbit

BECCA finds movie trailers while watching the Big Bang Theory

BECCA finds movie trailers while watching the Big Bang Theory

Listening for unexpected sounds: BECCA detects anomalies in audio data

Listening for unexpected sounds: BECCA detects anomalies in audio data

Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features

Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features

Watching for the unexpected: BECCA detects anomalies in video data

Watching for the unexpected: BECCA detects anomalies in video data

BECCA finds a stationary target

BECCA finds a stationary target

BECCA finds a stationary target at 3X speed

BECCA finds a stationary target at 3X speed

BECCA watches the X-men and Bruce Lee

BECCA watches the X-men and Bruce Lee

BECCA plays Quidditch

BECCA plays Quidditch

BECCA chases a ball

BECCA chases a ball

BECCA chases a ball, part 2

BECCA chases a ball, part 2

Becca chases a ball, part 3

Becca chases a ball, part 3

BECCA creates features from MNIST

BECCA creates features from MNIST

How reinforcement learning works in Becca 7

How reinforcement learning works in Becca 7

Deep Learning Demystified

Deep Learning Demystified

How Data Science Works

How Data Science Works

How Convolutional Neural Networks work

How Convolutional Neural Networks work

How Bayes Theorem works

How Bayes Theorem works

How Deep Neural Networks Work

How Deep Neural Networks Work

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

How Support Vector Machines work / How to open a black box

How Support Vector Machines work / How to open a black box

How autocorrelation works

How autocorrelation works

Getting closer to human intelligence through robotics

Getting closer to human intelligence through robotics

A minimalist's guide to slicing and indexing pandas DataFrames

A minimalist's guide to slicing and indexing pandas DataFrames

How decision trees work

How decision trees work

Data scientist archetypes

Data scientist archetypes

How to use python's datetime package

How to use python's datetime package

How optimization for machine learning works, part 1

How optimization for machine learning works, part 1

How optimization for machine learning works, part 2

How optimization for machine learning works, part 2

How optimization for machine learning works, part 3

How optimization for machine learning works, part 3

How optimization for machine learning works, part 4

How optimization for machine learning works, part 4

How convolutional neural networks work, in depth

How convolutional neural networks work, in depth

How to pick a machine learning model 4: Splitting the data

How to pick a machine learning model 4: Splitting the data

How to pick a machine learning model 3: Choosing a loss function

How to pick a machine learning model 3: Choosing a loss function

How to pick a machine learning model 2: Separating signal from noise

How to pick a machine learning model 2: Separating signal from noise

How to pick a machine learning model 1: Choosing between models

How to pick a machine learning model 1: Choosing between models

How to pick a machine learning model 5: Navigating assumptions

How to pick a machine learning model 5: Navigating assumptions

What do neural networks learn?

What do neural networks learn?

Interview with iRobot's Director of Data Science Angela Bassa

Interview with iRobot's Director of Data Science Angela Bassa

How Backpropagation Works

How Backpropagation Works

Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization

Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization

1D convolution for neural networks, part 1: Sliding dot product

1D convolution for neural networks, part 1: Sliding dot product

1D convolution for neural networks, part 2: Convolution copies the kernel

1D convolution for neural networks, part 2: Convolution copies the kernel

1D convolution for neural networks, part 3: Sliding dot product equations longhand

1D convolution for neural networks, part 3: Sliding dot product equations longhand

1D convolution for neural networks, part 4: Convolution equation

1D convolution for neural networks, part 4: Convolution equation

1D convolution for neural networks, part 5: Backpropagation

1D convolution for neural networks, part 5: Backpropagation

1D convolution for neural networks, part 6: Input gradient

1D convolution for neural networks, part 6: Input gradient

1D convolution for neural networks, part 7: Weight gradient

1D convolution for neural networks, part 7: Weight gradient

1D convolution for neural networks, part 8: Padding

1D convolution for neural networks, part 8: Padding

1D convolution for neural networks, part 9: Stride

1D convolution for neural networks, part 9: Stride

The Four Grand Challenges of Robots in the Home

The Four Grand Challenges of Robots in the Home

How Convolution Works

How Convolution Works

The Softmax neural network layer

The Softmax neural network layer

Batch normalization

Batch normalization

Getting ready to learn Python, Mac edition #1: Files and directories

Getting ready to learn Python, Mac edition #1: Files and directories

This video teaches the basics of Bayes Theorem and its application to real-world problems, providing a comprehensive understanding of Bayesian inference and its importance in machine learning.

Key Takeaways

Make a guess based on what you know about the situation
Capture Common Sense knowledge about the situation and help you to make better guesses
Calculate joint probability by multiplying conditional probability and probability of first event
Calculate marginal probability by adding up probabilities of all possible combinations of events
Use Bayes' Theorem to calculate probability of an event given another event and conditional probability of first event given second event
Update the prior belief with new measurements to obtain a posterior distribution
Use a uniform prior distribution and calculate the likelihood of each measurement given a certain weight

💡 Believing things that are not true can make it hard or impossible to learn from data, and the way to avoid this pitfall is to always believe things that we think are impossible at least just a little bit

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

[PoV] When Everyone Is Smart, No One Is

In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise

The Honeymoon Is Over: AI Music Has Entered Its Institutional Era

AI music has transitioned from proving its functionality to proving its value and deservingness of existence

Critical thinking in the AI Era

Develop critical thinking skills to navigate the AI era effectively and make informed decisions

Medium · Data Science

Anthropic Just Passed OpenAI Among Business Users. Here’s What That Means for Your Stack.

Anthropic surpasses OpenAI in business user adoption, impacting the AI stack for enterprises

Channels Television