Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Weights & Biases · Advanced ·📐 ML Fundamentals ·6y ago

Key Takeaways

The video discusses the concept of surprise in machine learning, its relationship to negative log probabilities, and how it is used in various machine learning algorithms and techniques, including maximum likelihood estimation, KL divergence, and softmax function.

Full Transcript

okay this is Charles here can people hear me great okay so it looks like lavanya's had a little bit of technical troubles so I'm just gonna go ahead and jump into my talk so if you have other questions for Chari make sure to put those in the slack community so what I'm gonna talk about today is something less practical and more connected to the mathematics with machine learning which is why you should be thinking with negative log probabilities instead of probabilities so machine learning is at the intersection of three separate areas of math at very least but the most fundamental ones are definitely these three linear algebra calculus and probability and so they these are old and like storied versions are types of mathematics and so there's ideas about how these areas of maths should be explained what they're for so for linear algebra it's that it's algebra for solving equations with matrices instead of just numbers with calculus is that it's for studying rates of change and for studying areas under curves and for probability it's that it's tools for manipulating distributions but in the last couple weeks I've been talking about how these traditional views aren't well-suited to the problems that we have in machine learning that linear algebra is not like out through and that there's a better way to sort of think about calculus so the in the two previous salons I talked about how linear algebra is maybe a little bit more like computer programming in this and it that it's the study of functions that can be represented by arrays more than it is the arrays themselves there's also the in the most recent salon I talked about how calculus is more about using those linear maps to approximate functions at least the way we use it in machine learning where we mostly care about gradients so I'm going to round out this series at least for now by talking about a better way to understand probability best to think of probability as the mathematics of surprise if you want something that's a little less poetic and a little bit easier to connect to other stuff this is basically saying I'm going to go from the information theory perspective on probability rather than the measure theoretic probability or the statistical sort of approach to probability and we're gonna get there by thinking about what I call the surprise game where modelling is a competition to see who can be the least surprised by the data so the I'm gonna contrast the competition with the way we solve debates in classical logic if I claim X and you claim why the way we resolve our disagreement is that one of us provides a proof so if I say P plus two is four and you say two plus two is five one of us needs to provide a proof that what we're saying is true or the other person's saying is not true famously the debate between Socrates is mortal and Socrates is not mortal as one of the first debates done and classical logic turns out Socrates is in fact immortal but logic like this just doesn't immediately apply to most things in machine learning like this collection of pixels is a photo of a dog there's no way to take our ands or ORS our X ORS are disjunctive syllogism and if it seems to you a little bit surprising I would consider maybe some examples of pictures like this from Twitter use the Twitter user at teeny biscuit these are images some of them are Chihuahuas others are blueberry muffins and it's actually pretty hard to tell which one is which and some machine learning libraries like publicly provided ones struggle on on this data set and it's even harder to imagine how we might write down some set of some sort of set of basic logic that would tell us this picture has a dog in it or not there's a there's a whole bunch of other examples things that happen in the future like team a will be team B this stocks price will go up where the stocks price will go down the number of deaths from proto virus will be over 1 million there are a lot of questions that we want to ask but we can't use logic to answer them so let's consider instead of debate involving proofs let's consider a competition each of us writes down how surprised who would be for each possible outcome before it happens this is how surprised that B of team a wins this is how surprised that would be if team B wins and it's important to say that it's no fair saying nothing surprises me so there's a technical mathematical way to enforce that but basically you have to say that some things are surprising to you maybe everything's the same amount of surprising but but you can't just say everything is completely unsurprising and then whoever is least surprised by the outcome wins the debate at least this time around and maybe you do this over and over and over again on many examples of team a team B playing each other hopefully you wouldn't do it on multiple examples of the world suffering from the corona virus pandemic but in lots of other example of the cases you can do this over and over again and from this setup of this game of this competition of who could be at least surprised you actually derive maximum likelihood estimation and the cool back regular divergence and a bunch of other tools that are actually where many of the loss functions that we use in deep learning and machine learning in general come from things like the cross entropy or even the squared error or the absolute error we can get them using this framework by just thinking of them in terms of this idea of surprise and so when I have something pretty specific in mind when I talk about surprises just like people the specific thing in mind when they say probability even though we have this informal notion of chance so something is just a little bit less probable it happening is just a little bit more surprising and if there's a lot less probable it's a lot more surprising if something is certain it happening is not at all surprising and if something is impossible it happening is more surprising than anything else so numerically this is saying that our surprise is a continuous function of the probability that's the first thing tiny changes and in probability mean tiny changes in surprise and if something is certain it being not surprising means that if something has a probability 1 if happening the surprise is 0 and then if something is impossible if it's probability 0 then it's surprise is infinite that's the only way that it can be more surprising than anything else the important difference between probabilities and surprises is that where probabilities multiply surprises add so if two unrelated surprising things happens we just add those two surprises together so this is the equivalent of the probability rule which is that if two things are independent then we multiply their probabilities to get their joint probability and so we just we've changed your the same basic rule but now we're going to use a different mathematical operation to do it addition instead of multiplication and together these define a mathematical notion of surprise so this is a function of the outcomes just like the probability is a function of the outcomes and we can define it in terms of the probability that it's the log of 1 over the probability this is also known as the surprisal and i think i titled this talk originally surprisal it's not a super well known idea on its own not something that people often feel like they need to give a name so it's a little bit open-ended and I think surprise rather than surprise Zul is the right name for it it's kind of a silly sounding name surprisal so the this surprise is the thing that when we do information theory we take we take expected values or averages of this to get the quantities that we're interested in things like mutual information and entropy and stuff like that but in and so if you want to go further with these slip this set of ideas then go ahead and check out stuff about information theory and they're gonna run with this idea but very few approaches to information theory actually start by defining this thing the surprise and then going on and defining everything else they usually start with something like the entropy so one exception is this guy Edwin James who is one of the like original Bayesian and maybe contributed more to Bayesian inference than Bayes did himself so is this great text probability theory the logic of science that starts by defining the surprise and the probability and goes from there so I'd strongly recommend that as a way to learn an information theory that's very relevant to what we do in machine learning so what we're gonna focus on things that are just about this surprise function and not about information theory quantities so we're not going to take these expectations and things that they do in information theory so the first is a very simple one which is some folks and statistics have started to say that you should use this surprise instead of the p-value to communicate your results so this is something that's maybe going to make more sense to folks who've done some traditional statistics folks with like a medicine background or a science background where these quantities get used so using surprise might actually make it easier to interpret and understand the outcomes of your hypothesis tests so the p-value is what people usually use they say they do all their statistics they do all their experiments they want and the end to know is this something that could have happened by chance because of something uninteresting the null hypothesis and what people want the p-value mean is something like the chance that the null hypothesis is true I am trying to demonstrate that the moon is made of cheese the null hypothesis is that it is not and so like to be able to tell people the moon is definitely not not made of cheese so this this p-value is small when it's unlikely that the null hypothesis is true that's what people would like it to me but in fact it's actually the other way around so it's instead of it being the probability of the null hypothesis when we've gotten a positive result from our statistical test it's the probability of the of a positive result on our statistical test if we assume the null hypothesis and it's basically backwards from what people actually want but unfortunately this thing people actually want is much harder to get and so people calculate this thing the p-value then they misinterpret it as the thing that they really actually wanted so one suggestion that's I'm going to link to this archive paper that puts it out there it's also something that the the American statistician the Journal of the American Statistical society one of the major ones they've also put this idea out there instead we should do the logarithm of 1 over the p-value and what that says is I'm trying to communicate to you how surprising my results are to somebody who is a skeptic how surprising are my results if you believe this no model this this model in the null hypothesis and they go in opposite directions right the a large s is the kind of thing that you would get if you have a publishable result whereas a low P is the thing that you have a that you want in order to have a publishable result but apart from that superficial difference there's a lot of more substantial differences and so this paper goes into detail about them but it discourages this misinterpretation of the p-values of probability and helps you has a lot of additional benefits as well for combining information across across studies and for being able to better disambiguate between between really strong evidence and moderately strong evidence and things the other thing that surprised is useful for is it is useful for thinking about the densities and distributions that we work with when we do probability when we use probability in our machine learning and so I'm going to borrow this example that I saw on Twitter and asked just looking at these which of these do you think is a Gaussian so which of these is a normal distribution or a bell curve to me they all look like bell curves in some loose sense and it's really hard for me to tell I would have probably said that all of these yeah those all look kind of like a Gaussian to me but in fact only one of these is actually a Gaussian and it's the one in the top left the others the logistic distribution is pretty close to the Gaussian in a lot of ways but the other is the Koshi distribution is one that actually has infinite variance so there's no standard deviation of this distribution it's so wide the beta distribution is the exact opposite but be the distribution down in the bottom right corner is actually zero outside of the range from minus 4 to 4 that this is a that's in the center of the plot and that's a huge difference under this distribution nothing outside of plus and minus 4 can be generated that's a very big difference but if we just look at these probability densities on their own is really hard to tell the difference this seems like really bad it would be great if we the way with that we mathematically represented our distributions made these kinds of differences like plain as day the the way they would be if you actually started you know sampling from these distributions so logarithms of densities are easier to compare so this surprise is going to make our densities easier to compare to each other the important differences in our probability distributions are often in the really unlikely events in the tails is this a once in a year event or a once in a millennium event the raw values are really close right 1 over 365 is only about 1 over 365 away from 1 it 365,000 thinking of these as numbers but their logarithms are actually you know relatively far apart the there's basically three orders of magnitude between them so if I log ten the difference of those two logs is going to be three so that makes these things that are actually very very different that would be experienced very very differently if you were to draw samples actually look different once we write out the numbers and in addition logarithms densities are actually even easier to work with so a lot of densities has a very simple form in the log like the Gaussian on the Left we have the usual way that people write the Gaussian distribution in terms of its probabilities and so we've got an exponent we've got like e on the bottom and then stuff in the top and I look at that and I don't li know what that function looks like or what's going to happen if I add two of them together or multiply two of them that's kind of confusing but if I take the logarithm then that e disappears and what I have instead is just a parabola of X minus mu squared and for me prep parabolas for something that I encountered really early in my mathematics education you know a grade school and so that's something that I've gotten a chance to use a lot and I'm comfortable and familiar with so these types of densities that have these really neat cleaned forms when they take the logarithm or that are easy to mathematically work with once you take a logarithm are called exponential families or log linear families and they're all over the place in sort of more older schools structured statistical approaches and then they sort of bleed into deep learning machine learning if you look at variational auto-encoders there's gonna be lots of pieces of them that use these exponential or log linear families in them and so these are just they happen to be the class of densities that we can do math easily with and that allows us to do machine learning create algorithms with these densities and they'll be easier to understand easier to think about in a lot of cases if you look at them with their logarithms so on this slide I told you that the Gaussian distribution that comes with the logarithm it becomes a parabola so looking at these these are the same four distributions I change the order so it's not or maybe I didn't so it's not definitely the one in the top left but these are these are the same four distributions but now I'm showing you the logs instead of the raw densities so this would basically just take that previous plot put it in a log scale in Python single line and the what's interesting about this to me is that they look super super different from each other now so we can we can see that one of them is sort of zooming off to infinity the one in the top left we can see that the one in the bottom left and the top right that those two are not going down nearly as quickly as the one in the bottom right so the one in the bottom right is going down really quickly and the pace at which it is going down is getting faster and faster right so the slope is increasing as we go in as we go out from the middle and that is that's a characteristic of a parabola this guy in the bottom right hand corner is pretty easy once you know what to look for it's pretty easy to identify this guy as a parabola and that's our Gaussian and the others are the different ones and it makes it clear that what's different between this logistic and this Koshi both of which looked pretty similar to the Gaussian before is that the way they go to the way they go to zero as they go away from the center there's actually a lot more that you can learn just from comparing these two ways of looking at distributions and I recommend you check out this example blog post by this guy Ryan Moulton that goes into more detail about how to use these things and I'll pop a link to these slides into the chat once I get to the end and then lastly the great thing about these surprises is that the connect Bayes rule and linear algebra and these are two of my favorite things so anything that comes in you know in the middle of them is gotta be green so what one pity way to state this is that with surprises we can treat our beliefs like vectors so the reason why this happens is that logs turn multiplication into addition so on the left here we've got the the negative logarithm takes things that are positive numbers and turns them into just any old number some of them negative some of them positive and e to the minus X takes a number X that's positive or negative and makes it solely positive so it and these two things are opposites of each other so this is called an isomorphism so it says that these two things we've mapped one onto the others map one onto the other and in particular what this does is it takes things that were addition for for our for our positive and negative numbers for what basically you can think of it as floats or real numbers it takes addition in the real numbers and it turns that into multiplication in the positive only numbers so that's in the top right corner of this slide a plus B is turned into e to the minus a times e to the minus B so we're before we would have added two numbers now if we multiply those two numbers together and if we use this we can define a vector space of probability distributions in a straightforward way so what we do is we take this even minus operation and we apply it to all vectors of length N and that gives us all positive only vectors of length n so we basically take that n-dimensional space over there on the left and we map it all together into the upper right-hand quadrant the positive only quadrant and then we can take that and if we divide those by their sum then we end up with only things that add up to one only vectors of length n that add up to 1 and those describe probability distributions on n outcomes so so in this case n is 3 so this is all probability distributions with three possible outcomes three possibilities so like I guess there aren't three-sided dies there aren't three sided coins but something that could take three possible three possible outcomes and now if I want to know and I also have that negative log that takes me backwards from the positive only vectors of length n to all vectors like that that be an arrow pointing to the left underneath e to the minus so if I want to add together two probability distributions if I want to define a notion of addition for probability distributions I take their negative logarithm and add that together and so I take the negative logarithm to turn them back into into just a generic vector of length N and then I add them together that and that allows me to define a notion of vector addition for probability distributions and I can do the same thing for scalar multiplication the other nice thing about this way of thinking about probability distributions is that it's actually something you've seen before if I combine these two operations what I'm doing is I'm taking something even the minus something that I'm dividing by the sum of those things and that combination is what people call the softmax function so e to the something divided by sum of e to the something that's softmax so what softmax is doing is it take it's taking things that are like normal vectors and turn them into probability distributions I mean new thing that I'm adding in this discussion here is I'm saying let's not just think oh I've sort of taken these numbers and just to sort of normalize them so they sum to one that's not that interesting is that now we know how to do we can take things that we could do two vectors of length n right we can add them together and subtract them and multiply them by numbers and apply matrices to them and calculate eigen values of those matrices and all these things we now do that with probability distributions we just map them backwards with minus log or yeah with minus log and now and then we apply our operations and then we come back if we want the probability distributions at the end and the the addition in this vector space is Bayesian updating what used to be with probability distributions it would be multiplying two things together and but when we use minus log to turn them into vectors that we can add then we're now adding them together so on the left hand side we have the likelihood in the prior thought of as surprises and on the right hand side we have the likelihood and the prior thought of as probabilities and so minus log turns our probabilities into surprises and allows us to add them instead of multiplying them together and all the normalization stuff gets included as part of that transformation once we turn things back into into probability distributions this is part of how I think about lots of deep learning models that have a soft mix in them like some kinds of detention models and some kinds of categorization like categorical models classification models that what they're doing is inside the network they're working to build up a nice representation of this vector space that's on the left-hand side here where I can wear all the stuff that I want to do with like complicated inference is you know just simple addition and then at the very end I use something like softmax to turn it into a probability distribution to use it for something so this example comes from Tom line stir at the end category cafe the and he goes into more detail about it and focuses on an example from statistical mechanics from from physics so the takeaways are that those surprises are negative log probabilities and they give us another lens for understanding uncertainty and it complements normal probabilities some things that are really easy to understand the surprises are a lot harder to understand or to see with probabilities and then vice-versa some things like taking an average are actually kind of hard with surprises but really straightforward with probabilities so it's not necessarily the one is better or worse than the other it's just that they are both useful perspectives so when I need to do a problem in probability when I need to understand what's going on with some new deep learning model that has some probabilistic components to it I try and you know attack it with both but if the reason the thing that gives me a sense that maybe the law of probability way of thinking about it is going to be more important is any time that there's multiplication of probabilities which is really often it's in Bayes rule it's in independent independence assumptions it's in conditional probabilities all kinds of places anytime there's multiplication a logarithm is going to turn that into addition and it's usually easier to think about adding things rather than multiplying them so I wanted to put this out there just because you might see people talking about log probabilities when they're doing the derivation or if you look in texture flow probability or PI torch pyro or any other probabilistic programming library you're gonna see log probabilities everywhere and most people will say oh it's for convenience to simplify our derivation to make things mathematically make things numerically easier on a computer to make sure that there's not overflow and underflow and people act like it's like something to be embarrassed of or unimportant but it's actually there's a reason why it makes things easier there's a reason why makes things cleaner and I think it's important to have that in mind so that in the future you can make other things easier and other things cleaner with lot of probabilities so hopefully that was useful they'll include some links to additional blog posts and material that goes into greater depth on these things and with that I'll take any questions than anybody's got there's some questions and ask a question tab choice if you wanna hop on that and I don't see anything in the slack so far okay okay so yeah so question from money why not one over P Y log one over pages long help magnify that value to be able to see it closer so yeah actually so lots of the things that I said one over Pete would have worked pretty well and there actually are lots of lien logs and regular probabilities aren't actually you know the only two ways people think about these same ideas and I remember correctly there are some ways that use inverse probabilities also but the real bonus is that is what I kind of probably got to a little bit out of after you asked this question it's that this logarithm is what gives you the is what gives you the turning multiplication into addition and that also means that it turns the things that were ratios into differences and so things that used to be ratio space from each other become linearly spaced so this is something we've all had to learn during the corona virus pandemic the people tend to show the cases with by taking the logarithm of the number of cases so they show log scale plots and that's because it was something that grows exponentially it's less useful to think about the absolute differences between things and more useful to think about their ratios and so another way of saying it of saying a lot of things I said in this talk is that probabilities are the kind of thing you care about the ratio not just the absolute differences and yeah so for the other things I will include once this this this guy is posted I will include some links to additional stuff about soft maxes and about uncertainty and inference I will say that yeah I have a I taught a course on Bayesian inference at Berkeley in the fall that sort of presents things some from this perspective but actually maybe a better one would be statistical rethinking by Richard McElderry so that one is primarily in our but people have translated it into tensorflow probability at least and I think also Cairo and that one if you google that you should be able to find it there's videos online that and in addition it's a textbook and and exercises so that's the that would be where I would go to get some of this education cool Thank You Charles does anyone have any more questions for Charles before we yes something more from next time Charles all right well that's the end of this series to come up with something to talk about I might talk about my thesis maybe they'll give you a break and maybe have you on like a couple of cilantro net Thank You Charles

Original Description

Charles Frye (he/him/his) is a researcher studying neural network optimization at the Redwood Center for Theoretical Neuroscience at the University of California, Berkeley and a deep learning instructor at Weights & Biases. Slides: http://wandb.me/2020-04-28-salon-surprise The content of this and related lectures has now been packaged into a short course, "Math for Machine Learning": http://wandb.me/m4ml-videos 👩🏼‍🚀Weights and Biases: We’re always free for academics and open source projects. Email carey@wandb.com with any questions or feature suggestions. - Blog: https://www.wandb.com/articles - Gallery: See what you can create with W&B -https://app.wandb.ai/gallery - Continue the conversation on our slack community - http://wandb.me/slack
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 54 of 60

1 0. What is machine learning?
0. What is machine learning?
Weights & Biases
2 1. Build Your First Machine Learning Model
1. Build Your First Machine Learning Model
Weights & Biases
3 Intro to ML: Course Overview
Intro to ML: Course Overview
Weights & Biases
4 2. Multi-Layer Perceptrons
2. Multi-Layer Perceptrons
Weights & Biases
5 3. Convolutional Neural Networks
3. Convolutional Neural Networks
Weights & Biases
6 Weights & Biases at OpenAI
Weights & Biases at OpenAI
Weights & Biases
7 Why Experiment Tracking is Crucial to OpenAI
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
8 4. Autoencoders
4. Autoencoders
Weights & Biases
9 5. Sentiment Analysis
5. Sentiment Analysis
Weights & Biases
10 6. Recurrent Neural Networks [RNNs]
6. Recurrent Neural Networks [RNNs]
Weights & Biases
11 7. Text Generation using LSTMs and GRUs
7. Text Generation using LSTMs and GRUs
Weights & Biases
12 8. Text Classification Using Convolutional Neural Networks
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
13 9. Hybrid LSTMs [Long Short-Term Memory]
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
14 Toyota Research Institute on Experiment Tracking with Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
15 Weights and Biases - Developer Tools for Deep Learning
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
16 Introducing Weights & Biases
Introducing Weights & Biases
Weights & Biases
17 10. Seq2Seq Models
10. Seq2Seq Models
Weights & Biases
18 11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
19 12. One-shot learning for teaching neural networks to classify objects never seen before
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
20 13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
21 14. Data Augmentation | Keras
14. Data Augmentation | Keras
Weights & Biases
22 15. Batch Size and Learning Rate in CNNs
15. Batch Size and Learning Rate in CNNs
Weights & Biases
23 Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
24 Grading Rubric for AI Applications with Sergey Karayev  (2019)
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
25 16. Video Frame Prediction using CNNs and LSTMs (2019)
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
26 Image to LaTeX - Applied Deep Learning Fellowship (2019)
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
27 17.  Build and Deploy an Emotion Classifier (2019)
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
28 Applied Deep Learning - Data Management with Josh Tobin (2019)
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
29 Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
30 Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
31 Troubleshooting and Iterating ML Models with Lee Redden (2019)
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
32 Designing a Machine Learning Project with Neal Khosla (2019)
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
33 Lukas Beiwald on ML Tools and Experiment Management (2019)
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
34 Building Machine Learning Teams with Josh Tobin (2019)
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
35 Pieter Abeel on Potential Deep Learning Research Directions  (2019)
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
36 Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
37 Five Lessons for Team-Oriented Research with Peter Welder (2019)
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
38 Applied Deep Learning - Rosanne Liu on AI Research (2019)
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
39 Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
40 Organizing ML projects — W&B walkthrough (2020)
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
41 Brandon Rohrer — Machine Learning in Production for Robots
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
42 Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
43 My experiments with Reinforcement Learning with Jariullah Safi
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
44 Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
45 Testing Machine Learning Models with Eric Schles
Testing Machine Learning Models with Eric Schles
Weights & Biases
46 How Linear Algebra is not like Algebra with Charles Frye
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
47 Predicting Protein Structures using Deep Learning with Jonathan King
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
48 Rachael Tatman — Conversational AI and Linguistics
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
49 Reformer by Han Lee
Reformer by Han Lee
Weights & Biases
50 Sequence Models with Pujaa Rajan
Sequence Models with Pujaa Rajan
Weights & Biases
51 GitHub Actions & Machine Learning Workflows with Hamel Husain
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
52 Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
53 Jack Clark — Building Trustworthy AI Systems
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
55 Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
56 Antipatterns in open source research code with Jariullah Safi
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
57 Attention for time series forecasting & COVID predictions - Isaac Godfried
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
58 Made with ML - Goku Mohandas
Made with ML - Goku Mohandas
Weights & Biases
59 Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
60 Deep Learning Salon by Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases

The video teaches the importance of surprise in machine learning, its relationship to negative log probabilities, and how it is used in various machine learning algorithms and techniques. It provides a new perspective on machine learning and probability theory, and offers practical advice on how to apply these concepts in machine learning pipelines.

Key Takeaways
  1. Define surprise as the logarithm of 1 over the probability
  2. Derive cross entropy and squared error from the concept of surprise
  3. Apply maximum likelihood estimation and KL divergence in supervised and unsupervised learning
  4. Use logarithmic scale to analyze exponential growth
  5. Implement probabilistic programming libraries in machine learning pipelines
💡 Surprise is equivalent to negative log probability, and allows for addition instead of multiplication in probability calculations

Related AI Lessons

The Python Dictionary Trick That Makes Interviewers Smile
Learn the Python dictionary trick that impresses interviewers and improves your coding skills
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects
Medium · Python
Machine learning for beginners #5
Learn the basics of machine learning through the analysis of self-driving cars and understand how ML is applied in real-world scenarios
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development
Medium · AI
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →