Random Forests : Data Science Concepts

ritvikmath · Intermediate ·📐 ML Fundamentals ·5y ago

Key Takeaways

The video explains how Random Forests work, including their use of decision trees, bagging, and random subspaces to reduce overfitting and improve generalization, with a focus on supervised learning and machine learning fundamentals.

Full Transcript

[Music] hey everyone welcome back in this video we're going to be talking about a very widely used machine learning model called the random forest now i have a lot of opinions on how the random forest model is considered by the machine learning community but i think they'll make more sense when we're finished with the content of this video so let's start with a real world example as always let's say that you're the data scientist for the education department in your city and your current goal is to build a model to figure out if a high school student will drop out before getting their degree now there's many features that you're going to use so for example you use their grades their gpa for example their demographic information their family income but you also have some features available to you that probably aren't going to be helpful like let's say their height their weight other things of that nature so let's say you have n students so these are sourced from a couple of different high schools in your city and let's say that there's p features where p is a relatively large number of features now let's say that your idea is i'm going to use the decision tree as my model so you go ahead and build your decision tree and i won't go into explaining how a decision tree works i have some videos on that which will be linked below but in a nutshell basically what it does is at each level it's going to split based on which features giving the most information or is the most helpful currently and then we work our way down like that so the next level it's going to pick which feature is the most helpful now and so on and so on until we get to the leaves of the decision tree at which point we're ready to say that yes the student will drop out or no the student will not drop out now let's talk about some of the pros of decision trees which is one of the reasons that people use them so much one of the big ones is that they're scale invariant what that means is that it doesn't matter if your features are in meters or feet or degrees fahrenheit or degrees celsius it really doesn't matter at all it's not necessary as it is with some machine learning models in order to get all your features at the same scale it's going to work fine just the way it is another big pro is that it's robust to irrelevant features so like i said before we have some probably irrelevant features here we have the height of the student for example so we don't need to worry about taking that out beforehand and it negatively affecting the model because of the way i describe decision trees if that feature is truly irrelevant it's just never going to get chosen to be split on so it's not doing any harm so you don't have to worry too much about weeding those out beforehand and i think the biggest pro of decision trees is their interpretability so what i mean by that is there's some machine learning models like svms or neural networks that there are some intuitions but it's really difficult to explain them to someone who's never really studied statistics before compare that with a decision tree you can basically just show them this output from your computer and say that i made the decision about whether the student's going to drop out or not by just following this decision path and that's a very natural way that people think about things basically it's a flow chart now as good as decision trees are there's one big con and that's why we need to use random forests the big con of decision trees is that they tend to overfit and i have a video on overfitting as well which i'm going to link below but the basic idea of overfitting is that the decision tree takes the data that's used to train it and learns it too well so that typically happens when the decision tree gets way too deep basically it's going to do really really well on your training set of end students here for example but when you try to use this on students outside of the sample maybe from other high schools it's not going to generalize it's not going to do very well because it's learned these small patterns which actually probably are just accidental in your training set so there are ways that we try to prevent this for example decision tree pruning we try to limit the depth of the decision tree but the fact is that decision trees are still kind of prone to this problem of overfitting and that's where the random forest comes in so a random forest is called a ensemble method and if that word is unfamiliar to an ensemble is just a collection of things that are all working together to a common goal so a random forest and that's where the forest part comes from is a collection of lots of decision trees that are working together how many decision trees we can choose that that's going to be b but it's typically a big number something in the hundreds or something in the thousands and the high level idea is that we're going to build thousands of these decision trees each one may be over fitted itself but if we consider them all together the final result the final prediction that the output will not be as prone to this overfitting problem so let me explain that in a little more depth by talking about the first addition that random forests offer on top of decision trees which is called bagging so this idea of bagging is not specific to decision trees or random force it's really something used in general in machine learning but let me explain it in the context of random forests so we're going to train b trees again something in the hundreds or thousands this is the pseudocode we're going to say for i equals 1 2 all the way to b we're going to do the following three steps so the first thing we're going to do is split our data randomly into an 80 20 split so for example 80 20 the exact split is really up to you this is just for example so we're going to take these n students and randomly take 80 percent of them to be our training set 20 will be our testing set then we build a single decision tree on this eighty percent of data we're going to be using for training and we call that single decision tree t sub i so for example if we're in the first step this will be t one then we measure the accuracy on the other twenty percent of data that we used for testing and we call that a one then we train the next decision tree so we're gonna go to i equals two we're gonna get a different 80 20 split so again this is random and we're going to build a second decision tree called t2 now since we used a different training set we're going to get a different tree t2 now after we finish this process we're going to have b decision trees so t1 all the way to tb and let's say a new student x comes along and we're trying to predict whether the student will drop out or will not drop out so what we do is we ask each individual decision tree so for each of these b decision trees we basically ask if you think the student will drop out or will not drop out and we call those decisions t one x all the way to t b x and in order to make a final decision we simply just take a majority vote if you're doing a regression problem you might take an average it really depends on your situation so we basically ask each of these trees what their decision is take a majority vote and that's going to be the decision we go with for the student now this also offers kind of an added bonus of prediction uncertainty for example let's say that we train a thousand decision trees in our random forest and let's say that when it comes time to ask if a given student will drop out or not drop out let's say that the vote is 604 and 400 against now that is a very different situation from if it's 904 and 100 against because in the 900 100 case we're a lot more confident about this prediction since there's only 10 percent of the trees are saying that it's going to go the other way versus the 600 400 case we're still going to say that the student will drop out because that's the majority but we can assign a lower confidence since the number of trees that are for and against is more close together so before moving on to idea number two which makes this actually a random forest let's reiterate idea number one bagging the reason we do bagging is because although a single decision tree might be overfitted to the training data when we use thousands of these trees together although each one individually might be overfitted to its respective training set when we use all of them we kind of wash out that overfitting so that the final prediction we get is a lot more robust has a lot less variance than if we used only a single decision tree now there's one more problem here that we need to address it's possible that some of these features are more important for this set of n students but that doesn't generalize well to the overall high school student population in our city just to give a concrete example let's say that for these n students that we sampled family income is really really important in making our prediction about whether they're going to drop out or not but let's say that in general family income is not as important as it is for those end students so why is this a problem because that means that even though we're training many many of these trees and doing a different 80 20 split each time we're probably going to come up with family income as an important feature for most or all of our decision trees and what that's going to lead to is our decision trees becoming highly correlated together which means that although we are training a thousand of them all these decision trees if we inspect them look kind of the same and this is a situation we would like to avoid because the whole point with doing bagging is to build decision trees that are a little bit different that are not exactly identical to each other so that we can actually reduce the variance from the case of a single decision tree and actually get estimates of this prediction uncertainty so how are we going to fix this problem the way we fix this problem is actually very parallel to the way that we did baggy bagging was used to basically randomly sample the rows and by rows i mean these n students the way we did that was basically doing a different 80 20 split each time now we can imagine randomly sampling the columns which means randomly sampling these p features so what we're going to do is for each of these decision trees that we build at each level when we think about which feature is the next one that we're going to split on we restrict the features that that decision tree is able to use at that point so we don't allow it to use all p features we randomly pick a subset of something smaller let's say p was equal to 100 we might only give it access to 10 features each time it's trying to split and decide which feature to split on this allows the model to better generalize what that means is that it's not always going to pick family income in every single tree now that we're restricting the features that it can split on at each point different decision trees will have access to a different set of these features and so we're going to introduce this variability this very necessary variability into our decision trees so that idea is called the random subspaces method so at each split we're only going to consider a subset of features and how do we know how many features to consider well these are some rules of thumb these are parameters that you're allowed to change these are just some rules of thumb if you're doing a classification problem people will typically pick the integer that's closest to square root of p so that's why i said that if we have a hundred features we typically allow 10 randomly chosen features at each step of the decision tree if you're doing a classification problem then people typically will use the integer closest to p divided by 3. so if this was a classification problem and we had p equals 100 features we might use something like 33 features but again these numbers are up to you there's something that you should actually vary and see how it changes the strength of your model so now just to recap these two modifications together the bagging idea and the random subspaces idea are what makes decision trees different from random forests and i want to make this really clear because here comes my opinions about random forests in the machine learning community i think there's a set of people who get a new data set and their first instinct is to just apply a random forest to it and that's not necessarily what you want to do the reason you don't want to do that is because if you don't know how a random forest works then you have no idea what you're actually doing and you might run into trouble down the road so i want everyone to fully understand how a random forest works and how it adds benefits on top of decision trees before you go ahead and just blindly apply it to whatever machine learning problem you might have so in a nutshell bagging allows us to introduce some variation on the rows so these n students and the random subspaces method allows us to introduce some variability in the columns or these p features and by introducing variability in these two dimensions that's where the random in random forests come from by introducing variability in both dimensions we allow the final model the final random forest to better generalize to students that it's never seen before students outside of these n students in our sample and now this video wouldn't be complete if i didn't talk about some of the cons of random forests so it still has most of the same pros we eliminate this tendency to overfit by considering many trees instead of just a single tree but there are two big cons i can think of the first is pretty obvious it's computational complexity so let's say that a single decision tree took you an hour to train well now you have a thousand decision trees so you do the math there it's gonna take a lot longer to train but the fact is that a single decision tree probably won't take you an hour with a strong enough computer it's probably gonna be somewhat tractable but i think the biggest con that people talk about and this is opinion number two of mine on how random force could talk about in the ml community there's another group of people who just won't touch random force at all they think that they're not interpretable decision trees were nice and interpretable why do we have to go and ruin them they won't even take a look at them so i will agree that the other con of random forest is that it does kind of take this interpretability away a little bit with a single decision tree we can just print out the decision tree and show someone and it's pretty obvious now we have like a thousand decision trees and it's harder to show everyone a thousand decision trees at a time but that doesn't mean that random forests have no interpretability the last thing we'll talk about is feature importance so we have these p features and we want to get some numerical measure about how important is each feature relative to the other features and if we have this we can say things like the gpa of the student was the most important predictor and maybe height was the least important predictor but how do we assign numerical values to this well it's a pretty simple four-step process and you can of course vary this in certain ways but here's the general flavor of the process the first step is we compute the accuracy on the ith training set so let me explain it in the pseudocode we randomly sample 80 20 we train on 80 and then we compute the accuracy on that training set so it's probably going to be really high because that's the exact training set used to build the model so it's probably going to be pretty strong the reason we do this is to compare it against the next thing that we do the next thing we do is we permute the jth feature so let me make things concrete for a second let's say we're trying to determine the importance of gpa versus the importance of height so let's look at gpa first we're going to permute the gpa of all the students permute basically means that we're randomly going to shuffle the gpa of all the students this would be disastrous for the model if gpa was an important feature right think about that what i'm saying is that if gpa was a very important feature a very important predictor in most of our decision trees if i were to randomly jumble all the gpas of students then my accuracy should be dropping by a lot and that's exactly what i do next i compute the accuracy again so i use the same decision trees so t1 t2 all the way to tb and i apply them on the modified training set the modified training set being the same training set where i permuted all the gpas now since gpa was an important feature in my model my accuracy is going to drop by a lot because it really needed that gpa to be there so i'm going to get the accuracy of the model on the unmodified training set and i'm going to subtract the accuracy on the permuted training set and i'm going to do this for every single training set from 1 2 3 all the way to b and take the average over all training sets so what i'm expecting is that since gpa is a really important feature this is going to be a big drop in accuracy for all or most of my decision tree so this difference is going to be pretty big now let's say i'm trying to find the feature importance for height i'm going to do the same thing get the accuracy on the ith training set i'm going to permute all of the heights but because height was not an important factor to begin with it probably won't matter that much so now when i get the accuracy on these permuted training sets it's not going to change by a lot because i wasn't using that height for anything anyway probably so that change in accuracy averaged over all i training sets is probably going to be rather low and this is the number that i use in order to judge the feature importance the more important features are the ones where this accuracy changes by a lot meaning that that feature was important for the model the less important features are the ones where the accuracy changes by barely anything which means that the models weren't really using that feature for anything anyway therefore it's not important so this is a very easy way to judge the importance of the features even using a model as complex as a random forest so what i would say to people who think random force are not interpretable i think there are ways to get the interoperability of random force don't be scared of them and for people who are just ready to apply a random force to your data without even thinking about it i would say really understand the theory so the bagging and random subspaces idea behind random force before you do that so hopefully this video helped to understand random forest why we use them and how they work any comments please leave them below if you like this video please like and subscribe for more videos just like this and i'll see you next

Original Description

How do random forests work? Decision trees video: https://www.youtube.com/watch?v=kakLu2is3ds Decision tree pruning video: https://www.youtube.com/watch?v=t56Nid85Thg Overfitting video: https://www.youtube.com/watch?v=-JopeGg60QY
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from ritvikmath · ritvikmath · 0 of 60

← Previous Next →
1 Math Team Update
Math Team Update
ritvikmath
2 Single Variable Calculus Volume of a Sphere - Proof 1
Single Variable Calculus Volume of a Sphere - Proof 1
ritvikmath
3 Single Variable Calculus Volume of a Sphere - Proof 2
Single Variable Calculus Volume of a Sphere - Proof 2
ritvikmath
4 Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
ritvikmath
5 Multivariable Calculus Volume of a Sphere Proof - Double Integrals
Multivariable Calculus Volume of a Sphere Proof - Double Integrals
ritvikmath
6 The Euclidian Algorithm
The Euclidian Algorithm
ritvikmath
7 Proving the Chain Rule
Proving the Chain Rule
ritvikmath
8 Proving the Fundamental Theorem of Calculus Part 1
Proving the Fundamental Theorem of Calculus Part 1
ritvikmath
9 Proving the Fundamental Theorem of Calculus Part 2
Proving the Fundamental Theorem of Calculus Part 2
ritvikmath
10 Math Puzzle - Poison Perplexity
Math Puzzle - Poison Perplexity
ritvikmath
11 Math Puzzle - Poison Perplexity - Solution
Math Puzzle - Poison Perplexity - Solution
ritvikmath
12 Expected Value and Variance of Continuous Random Variables (Calculus)
Expected Value and Variance of Continuous Random Variables (Calculus)
ritvikmath
13 Expected Value and Variance of Discrete Random Variables (No Calculus)
Expected Value and Variance of Discrete Random Variables (No Calculus)
ritvikmath
14 Array Method
Array Method
ritvikmath
15 Complex Power Series and their Derivatives
Complex Power Series and their Derivatives
ritvikmath
16 Distributions - Intro
Distributions - Intro
ritvikmath
17 The Poisson Distribution
The Poisson Distribution
ritvikmath
18 The Bernoulli Distribution
The Bernoulli Distribution
ritvikmath
19 The Binomial Distribution
The Binomial Distribution
ritvikmath
20 The Continuous Uniform Distribution
The Continuous Uniform Distribution
ritvikmath
21 The Geometric Distribution
The Geometric Distribution
ritvikmath
22 The Triangular Distribution
The Triangular Distribution
ritvikmath
23 The Exponential Distribution
The Exponential Distribution
ritvikmath
24 The Borel Distribution + Notes on Poisson Distribution
The Borel Distribution + Notes on Poisson Distribution
ritvikmath
25 The Gamma Distribution
The Gamma Distribution
ritvikmath
26 The Normal Distribution
The Normal Distribution
ritvikmath
27 The Laplace Distribution
The Laplace Distribution
ritvikmath
28 The Chi - Squared Distribution
The Chi - Squared Distribution
ritvikmath
29 Overfitting
Overfitting
ritvikmath
30 Vector Norms
Vector Norms
ritvikmath
31 Truths Behind the Titanic : K-Nearest Neighbor
Truths Behind the Titanic : K-Nearest Neighbor
ritvikmath
32 The Mathematics of Breakups
The Mathematics of Breakups
ritvikmath
33 Sillyfish
Sillyfish
ritvikmath
34 Finding Optimal Paths - Dynamic Programming
Finding Optimal Paths - Dynamic Programming
ritvikmath
35 HowToDataScience : Scraping Twitter Data
HowToDataScience : Scraping Twitter Data
ritvikmath
36 Decision Trees
Decision Trees
ritvikmath
37 Perceptron
Perceptron
ritvikmath
38 Naive Bayes
Naive Bayes
ritvikmath
39 K-Nearest Neighbor
K-Nearest Neighbor
ritvikmath
40 Evaluating Machine Learning Models
Evaluating Machine Learning Models
ritvikmath
41 Decision Tree Pruning
Decision Tree Pruning
ritvikmath
42 K-Means Clustering
K-Means Clustering
ritvikmath
43 Gaussian Mixture Model
Gaussian Mixture Model
ritvikmath
44 Data Science - Fuzzy Record Matching
Data Science - Fuzzy Record Matching
ritvikmath
45 Time Series Talk : Autocorrelation and Partial Autocorrelation
Time Series Talk : Autocorrelation and Partial Autocorrelation
ritvikmath
46 Time Series Talk : Autoregressive Model
Time Series Talk : Autoregressive Model
ritvikmath
47 Time Series Talk : Moving Average Model
Time Series Talk : Moving Average Model
ritvikmath
48 Time Series Talk : ARMA Model
Time Series Talk : ARMA Model
ritvikmath
49 Time Series Talk : ARCH Model
Time Series Talk : ARCH Model
ritvikmath
50 Time Series Talk : White Noise
Time Series Talk : White Noise
ritvikmath
51 Time Series Talk : Stationarity
Time Series Talk : Stationarity
ritvikmath
52 Time Series Talk : ARIMA Model
Time Series Talk : ARIMA Model
ritvikmath
53 Time Series Talk : Lag Operator
Time Series Talk : Lag Operator
ritvikmath
54 Time Series Talk : What is Seasonality ?
Time Series Talk : What is Seasonality ?
ritvikmath
55 Time Series Talk : Seasonal ARIMA Model
Time Series Talk : Seasonal ARIMA Model
ritvikmath
56 So ... What Actually is a Matrix ? : Data Science Basics
So ... What Actually is a Matrix ? : Data Science Basics
ritvikmath
57 Derivative of a Matrix : Data Science Basics
Derivative of a Matrix : Data Science Basics
ritvikmath
58 Basics of PCA (Principal Component Analysis) : Data Science Concepts
Basics of PCA (Principal Component Analysis) : Data Science Concepts
ritvikmath
59 Eigenvalues & Eigenvectors : Data Science Basics
Eigenvalues & Eigenvectors : Data Science Basics
ritvikmath
60 The Covariance Matrix : Data Science Basics
The Covariance Matrix : Data Science Basics
ritvikmath

This video teaches how Random Forests work, including their use of decision trees, bagging, and random subspaces to reduce overfitting and improve generalization, with a focus on supervised learning and machine learning fundamentals. By understanding these concepts, viewers can build more accurate and robust machine learning models. The video also covers the importance of feature importance and interpretability in Random Forests.

Key Takeaways
  1. Build a decision tree model
  2. Split data randomly into training and testing sets
  3. Build a single decision tree on the training set
  4. Measure accuracy on the testing set
  5. Repeat steps for multiple decision trees
  6. Use a majority vote or average to make a final decision
  7. Restrict the features that each decision tree can use at each level
  8. Randomly sample the columns (features) for each decision tree
  9. Compute accuracy on the ith training set
  10. Permute the jth feature
💡 Random Forests can reduce overfitting and improve generalization by combining multiple decision trees and using bagging and random subspaces, but may be less interpretable than single decision trees.

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers
Medium · Python
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →