How Bayes Theorem works

Brandon Rohrer · Beginner ·📰 AI News & Updates ·9y ago

Key Takeaways

The video explains how Bayes Theorem works, providing examples of Bayesian inference and its application to real-world problems, using concepts such as conditional probabilities, joint probabilities, and marginal probabilities.

Full Transcript

beigan inference is a way to make guesses about what your data mean based on sometimes very little data the way it works is tricky but it's not magic it's definitely something that you can wrap your head around and it's not impossible to do so my goal is that by the time we're done talking you'll have a pretty crisp picture of how it works beian inference is just guessing in the style of Thomas baze who was a non-conformist presbyterian Minister he wrote a couple of books one about religion and one about probability so a beian inference is just guessing in the style of bay so to illustrate it imagine that you're at the movies and someone drops a ticket you pick it up and you can see them from behind you know they have long hair but you don't know whether they're a man or a woman so you have to make a guess based on what you know about the attendees at your movie theater you might say excuse me ma'am is this your ticket now imagine instead that this person is standing in line for the men's restroom knowing this extra piece of information you might make a different guess ban inference is a way to capture this Common Sense knowledge about the situation and help you to make better guesses so to put numbers to this dilemma at the movie theater let's assume out of 100 women at the movies 50 have short hair 50 have long and out of 100 men at the movies 96 have short hair and four have long in this case we can see that there are definitely more women with long hair than men with long hair so it's a safe bet to assume this person is a woman now we just made a subtle assumption that there are about the same number of men and women at the movies this assumption no longer holds when we move to the men's restroom line here let's say there are uh two women out of every hundred people and 98 men maybe uh women keeping their Partners company there's still one with short hair and one with long hair still half and half long and short hair but now there are four times as many men with long hair than women with long hair in this group now the safe money is to bet that this person is a man so to draw this a little differently out of a 100 people at the movies overall will make this assumption explicit that 50 of them are women 50 of them are men so this is how the different categories break down in the line for the men's restroom then they break down a little differently so to translate this to math the probability that a person person is a woman is the total number of women divided by the total number of people 50% similarly for men moving to the men's restroom line the probability that someone as a woman is 2% 98% for men now Bay's theorem is a little bit tricky so to be very precise we're going to have to talk math so if you bear with me for just three probability Concepts we'll let L the foundation for presenting Bas theem the first one is conditional probabilities if I know that a person is a woman that's the condition what's the probability that that person has long hair so it's written as probability of long hair given that a person's a woman so to get this we just divide the number of women with long hair by the total number of women 50% and this doesn't change whether there's 50 women in uh in their group or two women in the group still if we know that a person is a woman the probability that they have long hair is 50% we can do the same thing with men probability that someone has long hair given that they're a man is 4% so conditional probabilities if I know that b is the case what's the probability that a is also the case this is not you can't reverse B and a and have this be true so for instance if I know that the thing I'm holding is a puppy what's the probability that it's cute the probability is very high if I know that the thing I'm holding is cute what's the probability that it's a puppy well it might be a puppy might be a kitten it might be a hedgehog it might be a small human there's lots of things that it could be so the probability there is less moderate so these things are not interchangeable in conditional probabilities Now concept two joint probabilities so what's the probability that a person is both a woman and has short hair uh so to calculate a joint probability you first find their conditional probability well if I know that they're a woman what's the probability that they have short hair and then you multiply that by the probability that there are a woman so in this case .5 * 5 we get a 025 which is exactly what we knew it was going to be and the same is true for all of our different conditions so we can do this uh for the men's restroom too the probability that someone is a man and has long hair 4% someone is a woman and has long hair 1% joint probabilities are different than conditional probabilities here the probability that A and B is the case is the same that the probability that b and a is the case so the probability that I'm having a jelly donut with my milk is the same as the probability that I'm having a milk with my jelly donut these two conditions these two situations are identical so we can swap the order and finally concept three marginal probabilities if I wanted to say figure out the probability that someone has long hair I just add up all of the different ways that someone can have long hair they can be a woman with long hair or a man with long hair in the men's restroom line that's a 1% probability plus a 4% probability or a 5% probability overall you can do the same thing for short hair 95% now this last concept finishes our foundation we can get to what we really care about we know that this person has long hair what's the probability that they are a man or a woman this is a conditional probability but it's the reverse of the one that we know and we don't know how to answer this yet so this is where Thomas baz comes in he noticed something really cool you can calculate the joint probability that someone is a man and has long hair using the formula we saw before and you can also calculate the joint probability that someone has long hair and is a man now these are different formulas but remember joint probabilities don't care about the order so these two things are equal which means the stuff that they're equal to on the other side are also equal to each other and we can do a little algebraic slide of hand and now we have a formula for what we want to know if someone has long hair what's the probability that they're a man and we have this expression to the right we can uh genericize that with A's and B's and now we have Bay theorem one of the top 10 most popular math tattoos of all time so using this formula we can go back to the movie theater and plug in what we know we know that the probability that someone is a man we know the probability that if they're a man they have long hair and we know the conditional probability or sorry the um marginal probability that someone has long hair which is just the probability that someone's a woman with long hair plus probability that someone's a man with long hair and we plug all that in and we can say if someone has long hair at the movie theaters there is a 7% chance that they are a man similarly 93% chance that they are a woman now if they're in line for the men's restroom because some of those probabilities change that conditional probability changes someone's in line for the men's restroom and has long hair there's an 80% chance that they are a man and this is consistent with what we saw before we know that there are four men and one woman for every hundred people in a line for the men's restroom that have long hair so four out of five long haired people are men 80% it all adds up so this example shows uh the mechanics of how to get Bas theorem and how it works in practice it's usually used a little differently so to show this we'll have to do a little bit of a detour and first talk about probability distributions you could think of probability like a pot with just one cup of coffee in it you can fill up if you just have one cup to fill up you can fill it all the way to the top but if you have more than one cup you have to share it around or distribute it and you can share it in any proportion you want so for instance if we're representing the number of men and women at the movies we could share it 50/50 but it'll always add up to 100% we could even share it further into different categories so here we see the joint probabilities of all of our four different categories that we were working with and you can see that this is just another representation of the uh distribution representation that we were looking at before now usually when we look at this they're side by side uh probability instead of percentage and uh shown in a histogram like this it can be really helpful to think of these as beliefs so for instance if I flip a coin and hide the result from you you might half believe its heads and half believe its tails until I tell you what it is if I roll a die and hide the result from you you may believe about one sixth that it's a one or a two or a three or four or five or six until I show you the result so these are what you believe probabilities can represent what you believe about something before you measure it similarly for Powerball tickets and even for things that are more complicated like let's say the height of adults in the United States in centimeters you might believe that there's a very small chance that they'll be over 210 cm and a smallish chance that they're less than 150 cm and then assign various amounts of this belief to all of the ranges in between it still adds up to one it's like all a bunch of cups of coffee lined up in a row and you put a little bit in each one the cups in the middle have more um and it shows how your belief about some individual is distributed before you've actually measured them now you can take these bins and chop them more and more finely again and again and if you keep doing this you can get to something that's actually perfectly smooth so it's as if you had uh an infinite number of very tiny cups and you put a tiny bit infinite decimal amount of coffee in each one at this point it's probably no longer helpful to think of it in that terms but just thinking of it as a continuous distribution showing for all these Heights where am I placing my bets what do I believe and how much so once you have your beliefs you can use it to answer questions about typical Heights the average the median value the most common value or the mode now we'll use this in weighing my dog um I have a shih tzu named rain of terror um she's a little mischievous and when we go to the veterinarian rain squirms on the scale so every time we weigh her we get a different weight this last time we got 13.9 lbs 17 1/2 lb and 14.1 lb what we want to know is how much rain weighs and this will be the basis for a decision this is important if her weight has gone up her food intake will have to go down and for her this is a matter of life and death so we don't want to make the wrong assumption and draw the wrong conclusion so if you've ever taken a statistics class before you know you can take these measurements add them up get the average 15 2 lb you can calculate the standard deviation of these three measurements and also the standard error and come up with a 1.16 lb standard error which when you show it graphically this red curve now shows the belief that results from those three measurements the distribution the peak of that Hill is at 15.2 lbs and one standard deviation on that curve is our standard error of 1.2 lb so you can see looking at this that yes it's most likely that she's 15.2 lb but there's a lot of that curve that sits outside of the range of 14 to 16 so yeah she's probably between 14 and 16 lb most likely between 13 and 17 lb but she might even be lower than 12 and higher than 18 that is a we really why range and it's not a great basis for making a decision now you can see the three measurements there those three white vertical lines and you can see why our belief is so distri uh dispersed because those three measurements are pretty dispersed it's hard to capture all that in one distribution so let's try it again with Bay theorem so the way we'll do this is instead of A and B we'll substitute in W for her actual weight and M for the measurements that we took now this term over here the probability distribution of the actual weight is our prior this is what we believe about her weight before we put her on the scale the probability given a weight of getting certain measurements are the likelihood associated with those measurements and then the posterior is what we believe about her weight given those measurements so you can think of this as we start with a belief we take some measurements and we update it and then we have a new belief when we're done this term on the bottom we're going to ignore for the most part it'll be a constant but it's called the marginal likelihood so we're going to start by not assuming anything about her weight could be one PB 10 lb 20b 100 lb we're going to let this be uniform and we're going to let the data speak so now our formula looks like this we can further simplify it and so we want to calculate this we want to calculate the probability of our measurements occurring given a we and we want to do this for all of the possible weights and then we'll end up with a new distribution which is our belief what's the probability of each of those weights occurring given the measurements so these two things are identical so let's start for instance by assuming what if she weighed 17 lbs in reality now because our measurement process is very noisy as we saw if she weighed 17 lb we would expect those measurements to be distributed as shown here some would be up way above 18 lb some would be down around 14 lb where we actually measured but not very many of them would be so to calculate now the probability of our measurements occurring given this weight we find what the probability of each individual weight is of occurring and we multiply that times that times that now these two are pretty small when you multiply two small things together they make something very small so the probability of her being at 17 pounds is is pretty small we shift our belief over and say well what if she was 162 lb what if she was 16 lb and we recalculate it each time multiplying all of those actual probabilities together and then by the time we're done this is what we've measured at each of those weights this is the likelihood of each of those occurring and you can see that the maximum likelihood occurs at 15.2 lb um and this is commonly called the maximum likelihood estimate where you start with a uniform assumptions on your weight um and it just so happens that the standard error on this is exactly what we calculated before a very cool thing connection here when you take the average and calculate standard deviation and standard error it gives you the likelihood that you would get by doing Bay method and assuming a uniform prior not assuming anything about what the results going to be so we've already established though that that's a really broad result and not helpful so we need to start over now and let's start with what we know so some background information rain was 14.2 lb the last time we went into the vet and she doesn't seem noticeably more heavy to me my arm is not that well calibrated but let's I'm going to assume that she's within about a pound of where she was before so I take that prior and this is the form that it takes a normal distribution centered on 14.2 lb and you can see that most of that bulk is within plus or minus a pound and it extends a little bit further out I allow for the possibility that she's actually gained a lot or or lost a lot of weight but probably she's pretty close this is what I believe before I even put her on the scale this is the probability the prior the probability of her being a given weight so this time we're not neglecting the prior term we're not setting it constant we're going to use it and the way this plays out now is we assume okay what if she were 17 lb like well we need to multiply that now by the probability of our prior showing that she's 17 lb which actually makes that quite small now we calculate and multiply the three probabilities of our measurements occurring so now we have something small times something very small times something very small so we get a very small result uh probability that she will that she actually weighs 17 lb and now we repeat this process at 16 1 12 lb and 16 lb and 152 lb and 15 lb all the way through and then by the time we're done we tally up all of those and we get this new posterior distribution um it's normally distributed at about 14.1 lbs and it has a standard error of less than a pound you'll notice it's even narrower than our original uh prior so we've taken our original belief and we've been able to sharpen it up just a bit and so incidentally the peak of this curve is called the maximum of posteriori result if we had to choose one value to represent our belief that's not a bad one to choose and now we compare this with our original estimate it's labeled non beian here but more accurately it could be Bean with a uniform prior you can see that it is much broader and also the peak of that curve is in an entirely different place so the answer that we got it's more confident because it's more centered and it's probably based on what we know closer to being correct so this is how Bay theorem is used most often in data science or in analysis it's a prior that you then update based on your measurements to sharpen up and um get get a revised set of beliefs so there's a lot of times that it makes sense to use beigan inference um sometimes we just know things like if we're measuring age we know that everyone is more than Z years old and so we can take that information and build it in and we can get sharper estimates with fewer measurements now so it should if you're paying attention make you a little bit nervous um we humans are actually not always aware of what we believe and putting it into a mathematical distribution can be very tricky more importantly the reason we're measuring something is because we want to learn about it we want to be able to be surprised by our data so if we believe something that's not true it can make it hard or impossible to learn from our data I like how Mark Twain phrased this he says it ain't what you don't know that gets you into trouble it's what you know for sure that just ain't so so the way to avoid this pit fall is to always believe things that we think are impossible at least just a little bit so by leaving this room for something to be impossible we can do like uh Sherlock Holmes says and once you've excluded The Impossible then whatever remains however improbable must be the truth we don't want to exclude the improbable out of hand because then we're left with nothing Alice in a conversation with the Red Queen summed it up well too there's no use in trying she said one can't believe impossible things I dare say you haven't had much practice said the queen when I was younger I always did it for half an hour a day why sometimes I've believed as many as six impossible things before breakfast so the secret to us using beian inference well is to keep believing impossible things thanks for your attention here's how you can get in touch with me if you'd like to carry on the conversation I look forward to talking with you again soon

Original Description

Part of the End-to-End Machine Learning School Course 191, Selected Models and Methods at https://e2eml.school/191 A walk through a couple of Bayesian inference examples. The blog: http://brohrer.github.io/how_bayesian_inference_works.html The slides: https://docs.google.com/presentation/d/1325yenZP_VdHoVj-tU0AnbQUxFwb8Fl1VdyAAUxEzfg/edit?usp=sharing Follow me for announcements: https://twitter.com/_brohrer_
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Brandon Rohrer · Brandon Rohrer · 23 of 60

1 Robot Learning with a Biologically-Inspired Brain (BECCA)
Robot Learning with a Biologically-Inspired Brain (BECCA)
Brandon Rohrer
2 BECCA talk at AGI 2011
BECCA talk at AGI 2011
Brandon Rohrer
3 Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel
Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel
Brandon Rohrer
4 BECCA listens to The Hobbit
BECCA listens to The Hobbit
Brandon Rohrer
5 Learning the building blocks of speech: BECCA extracts a hierarchy of audio features
Learning the building blocks of speech: BECCA extracts a hierarchy of audio features
Brandon Rohrer
6 BECCA listens for sound effects in The Hobbit
BECCA listens for sound effects in The Hobbit
Brandon Rohrer
7 BECCA finds movie trailers while watching the Big Bang Theory
BECCA finds movie trailers while watching the Big Bang Theory
Brandon Rohrer
8 Listening for unexpected sounds: BECCA detects anomalies in audio data
Listening for unexpected sounds: BECCA detects anomalies in audio data
Brandon Rohrer
9 Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features
Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features
Brandon Rohrer
10 Watching for the unexpected: BECCA detects anomalies in video data
Watching for the unexpected: BECCA detects anomalies in video data
Brandon Rohrer
11 BECCA finds a stationary target
BECCA finds a stationary target
Brandon Rohrer
12 BECCA finds a stationary target at 3X speed
BECCA finds a stationary target at 3X speed
Brandon Rohrer
13 BECCA watches the X-men and Bruce Lee
BECCA watches the X-men and Bruce Lee
Brandon Rohrer
14 BECCA plays Quidditch
BECCA plays Quidditch
Brandon Rohrer
15 BECCA chases a ball
BECCA chases a ball
Brandon Rohrer
16 BECCA chases a ball, part 2
BECCA chases a ball, part 2
Brandon Rohrer
17 Becca chases a ball, part 3
Becca chases a ball, part 3
Brandon Rohrer
18 BECCA creates features from MNIST
BECCA creates features from MNIST
Brandon Rohrer
19 How reinforcement learning works in Becca 7
How reinforcement learning works in Becca 7
Brandon Rohrer
20 Deep Learning Demystified
Deep Learning Demystified
Brandon Rohrer
21 How Data Science Works
How Data Science Works
Brandon Rohrer
22 How Convolutional Neural Networks work
How Convolutional Neural Networks work
Brandon Rohrer
How Bayes Theorem works
How Bayes Theorem works
Brandon Rohrer
24 How Deep Neural Networks Work
How Deep Neural Networks Work
Brandon Rohrer
25 Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
Brandon Rohrer
26 How Support Vector Machines work / How to open a black box
How Support Vector Machines work / How to open a black box
Brandon Rohrer
27 How autocorrelation works
How autocorrelation works
Brandon Rohrer
28 Getting closer to human intelligence through robotics
Getting closer to human intelligence through robotics
Brandon Rohrer
29 A minimalist's guide to slicing and indexing pandas DataFrames
A minimalist's guide to slicing and indexing pandas DataFrames
Brandon Rohrer
30 How decision trees work
How decision trees work
Brandon Rohrer
31 Data scientist archetypes
Data scientist archetypes
Brandon Rohrer
32 How to use python's datetime package
How to use python's datetime package
Brandon Rohrer
33 How optimization for machine learning works, part 1
How optimization for machine learning works, part 1
Brandon Rohrer
34 How optimization for machine learning works, part 2
How optimization for machine learning works, part 2
Brandon Rohrer
35 How optimization for machine learning works, part 3
How optimization for machine learning works, part 3
Brandon Rohrer
36 How optimization for machine learning works, part 4
How optimization for machine learning works, part 4
Brandon Rohrer
37 How convolutional neural networks work, in depth
How convolutional neural networks work, in depth
Brandon Rohrer
38 How to pick a machine learning model 4: Splitting the data
How to pick a machine learning model 4: Splitting the data
Brandon Rohrer
39 How to pick a machine learning model 3: Choosing a loss function
How to pick a machine learning model 3: Choosing a loss function
Brandon Rohrer
40 How to pick a machine learning model 2: Separating signal from noise
How to pick a machine learning model 2: Separating signal from noise
Brandon Rohrer
41 How to pick a machine learning model 1: Choosing between models
How to pick a machine learning model 1: Choosing between models
Brandon Rohrer
42 How to pick a machine learning model 5: Navigating assumptions
How to pick a machine learning model 5: Navigating assumptions
Brandon Rohrer
43 What do neural networks learn?
What do neural networks learn?
Brandon Rohrer
44 Interview with iRobot's Director of Data Science Angela Bassa
Interview with iRobot's Director of Data Science Angela Bassa
Brandon Rohrer
45 How Backpropagation Works
How Backpropagation Works
Brandon Rohrer
46 Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization
Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization
Brandon Rohrer
47 1D convolution for neural networks, part 1: Sliding dot product
1D convolution for neural networks, part 1: Sliding dot product
Brandon Rohrer
48 1D convolution for neural networks, part 2: Convolution copies the kernel
1D convolution for neural networks, part 2: Convolution copies the kernel
Brandon Rohrer
49 1D convolution for neural networks, part 3: Sliding dot product equations longhand
1D convolution for neural networks, part 3: Sliding dot product equations longhand
Brandon Rohrer
50 1D convolution for neural networks, part 4: Convolution equation
1D convolution for neural networks, part 4: Convolution equation
Brandon Rohrer
51 1D convolution for neural networks, part 5: Backpropagation
1D convolution for neural networks, part 5: Backpropagation
Brandon Rohrer
52 1D convolution for neural networks, part 6: Input gradient
1D convolution for neural networks, part 6: Input gradient
Brandon Rohrer
53 1D convolution for neural networks, part 7: Weight gradient
1D convolution for neural networks, part 7: Weight gradient
Brandon Rohrer
54 1D convolution for neural networks, part 8: Padding
1D convolution for neural networks, part 8: Padding
Brandon Rohrer
55 1D convolution for neural networks, part 9: Stride
1D convolution for neural networks, part 9: Stride
Brandon Rohrer
56 The Four Grand Challenges of Robots in the Home
The Four Grand Challenges of Robots in the Home
Brandon Rohrer
57 How Convolution Works
How Convolution Works
Brandon Rohrer
58 The Softmax neural network layer
The Softmax neural network layer
Brandon Rohrer
59 Batch normalization
Batch normalization
Brandon Rohrer
60 Getting ready to learn Python, Mac edition #1: Files and directories
Getting ready to learn Python, Mac edition #1: Files and directories
Brandon Rohrer

This video teaches the basics of Bayes Theorem and its application to real-world problems, providing a comprehensive understanding of Bayesian inference and its importance in machine learning.

Key Takeaways
  1. Make a guess based on what you know about the situation
  2. Capture Common Sense knowledge about the situation and help you to make better guesses
  3. Calculate joint probability by multiplying conditional probability and probability of first event
  4. Calculate marginal probability by adding up probabilities of all possible combinations of events
  5. Use Bayes' Theorem to calculate probability of an event given another event and conditional probability of first event given second event
  6. Update the prior belief with new measurements to obtain a posterior distribution
  7. Use a uniform prior distribution and calculate the likelihood of each measurement given a certain weight
💡 Believing things that are not true can make it hard or impossible to learn from data, and the way to avoid this pitfall is to always believe things that we think are impossible at least just a little bit

Related AI Lessons

Up next
News At 10
Channels Television
Watch →