Random Forests : Data Science Concepts
Key Takeaways
The video explains how Random Forests work, including their use of decision trees, bagging, and random subspaces to reduce overfitting and improve generalization, with a focus on supervised learning and machine learning fundamentals.
Full Transcript
[Music] hey everyone welcome back in this video we're going to be talking about a very widely used machine learning model called the random forest now i have a lot of opinions on how the random forest model is considered by the machine learning community but i think they'll make more sense when we're finished with the content of this video so let's start with a real world example as always let's say that you're the data scientist for the education department in your city and your current goal is to build a model to figure out if a high school student will drop out before getting their degree now there's many features that you're going to use so for example you use their grades their gpa for example their demographic information their family income but you also have some features available to you that probably aren't going to be helpful like let's say their height their weight other things of that nature so let's say you have n students so these are sourced from a couple of different high schools in your city and let's say that there's p features where p is a relatively large number of features now let's say that your idea is i'm going to use the decision tree as my model so you go ahead and build your decision tree and i won't go into explaining how a decision tree works i have some videos on that which will be linked below but in a nutshell basically what it does is at each level it's going to split based on which features giving the most information or is the most helpful currently and then we work our way down like that so the next level it's going to pick which feature is the most helpful now and so on and so on until we get to the leaves of the decision tree at which point we're ready to say that yes the student will drop out or no the student will not drop out now let's talk about some of the pros of decision trees which is one of the reasons that people use them so much one of the big ones is that they're scale invariant what that means is that it doesn't matter if your features are in meters or feet or degrees fahrenheit or degrees celsius it really doesn't matter at all it's not necessary as it is with some machine learning models in order to get all your features at the same scale it's going to work fine just the way it is another big pro is that it's robust to irrelevant features so like i said before we have some probably irrelevant features here we have the height of the student for example so we don't need to worry about taking that out beforehand and it negatively affecting the model because of the way i describe decision trees if that feature is truly irrelevant it's just never going to get chosen to be split on so it's not doing any harm so you don't have to worry too much about weeding those out beforehand and i think the biggest pro of decision trees is their interpretability so what i mean by that is there's some machine learning models like svms or neural networks that there are some intuitions but it's really difficult to explain them to someone who's never really studied statistics before compare that with a decision tree you can basically just show them this output from your computer and say that i made the decision about whether the student's going to drop out or not by just following this decision path and that's a very natural way that people think about things basically it's a flow chart now as good as decision trees are there's one big con and that's why we need to use random forests the big con of decision trees is that they tend to overfit and i have a video on overfitting as well which i'm going to link below but the basic idea of overfitting is that the decision tree takes the data that's used to train it and learns it too well so that typically happens when the decision tree gets way too deep basically it's going to do really really well on your training set of end students here for example but when you try to use this on students outside of the sample maybe from other high schools it's not going to generalize it's not going to do very well because it's learned these small patterns which actually probably are just accidental in your training set so there are ways that we try to prevent this for example decision tree pruning we try to limit the depth of the decision tree but the fact is that decision trees are still kind of prone to this problem of overfitting and that's where the random forest comes in so a random forest is called a ensemble method and if that word is unfamiliar to an ensemble is just a collection of things that are all working together to a common goal so a random forest and that's where the forest part comes from is a collection of lots of decision trees that are working together how many decision trees we can choose that that's going to be b but it's typically a big number something in the hundreds or something in the thousands and the high level idea is that we're going to build thousands of these decision trees each one may be over fitted itself but if we consider them all together the final result the final prediction that the output will not be as prone to this overfitting problem so let me explain that in a little more depth by talking about the first addition that random forests offer on top of decision trees which is called bagging so this idea of bagging is not specific to decision trees or random force it's really something used in general in machine learning but let me explain it in the context of random forests so we're going to train b trees again something in the hundreds or thousands this is the pseudocode we're going to say for i equals 1 2 all the way to b we're going to do the following three steps so the first thing we're going to do is split our data randomly into an 80 20 split so for example 80 20 the exact split is really up to you this is just for example so we're going to take these n students and randomly take 80 percent of them to be our training set 20 will be our testing set then we build a single decision tree on this eighty percent of data we're going to be using for training and we call that single decision tree t sub i so for example if we're in the first step this will be t one then we measure the accuracy on the other twenty percent of data that we used for testing and we call that a one then we train the next decision tree so we're gonna go to i equals two we're gonna get a different 80 20 split so again this is random and we're going to build a second decision tree called t2 now since we used a different training set we're going to get a different tree t2 now after we finish this process we're going to have b decision trees so t1 all the way to tb and let's say a new student x comes along and we're trying to predict whether the student will drop out or will not drop out so what we do is we ask each individual decision tree so for each of these b decision trees we basically ask if you think the student will drop out or will not drop out and we call those decisions t one x all the way to t b x and in order to make a final decision we simply just take a majority vote if you're doing a regression problem you might take an average it really depends on your situation so we basically ask each of these trees what their decision is take a majority vote and that's going to be the decision we go with for the student now this also offers kind of an added bonus of prediction uncertainty for example let's say that we train a thousand decision trees in our random forest and let's say that when it comes time to ask if a given student will drop out or not drop out let's say that the vote is 604 and 400 against now that is a very different situation from if it's 904 and 100 against because in the 900 100 case we're a lot more confident about this prediction since there's only 10 percent of the trees are saying that it's going to go the other way versus the 600 400 case we're still going to say that the student will drop out because that's the majority but we can assign a lower confidence since the number of trees that are for and against is more close together so before moving on to idea number two which makes this actually a random forest let's reiterate idea number one bagging the reason we do bagging is because although a single decision tree might be overfitted to the training data when we use thousands of these trees together although each one individually might be overfitted to its respective training set when we use all of them we kind of wash out that overfitting so that the final prediction we get is a lot more robust has a lot less variance than if we used only a single decision tree now there's one more problem here that we need to address it's possible that some of these features are more important for this set of n students but that doesn't generalize well to the overall high school student population in our city just to give a concrete example let's say that for these n students that we sampled family income is really really important in making our prediction about whether they're going to drop out or not but let's say that in general family income is not as important as it is for those end students so why is this a problem because that means that even though we're training many many of these trees and doing a different 80 20 split each time we're probably going to come up with family income as an important feature for most or all of our decision trees and what that's going to lead to is our decision trees becoming highly correlated together which means that although we are training a thousand of them all these decision trees if we inspect them look kind of the same and this is a situation we would like to avoid because the whole point with doing bagging is to build decision trees that are a little bit different that are not exactly identical to each other so that we can actually reduce the variance from the case of a single decision tree and actually get estimates of this prediction uncertainty so how are we going to fix this problem the way we fix this problem is actually very parallel to the way that we did baggy bagging was used to basically randomly sample the rows and by rows i mean these n students the way we did that was basically doing a different 80 20 split each time now we can imagine randomly sampling the columns which means randomly sampling these p features so what we're going to do is for each of these decision trees that we build at each level when we think about which feature is the next one that we're going to split on we restrict the features that that decision tree is able to use at that point so we don't allow it to use all p features we randomly pick a subset of something smaller let's say p was equal to 100 we might only give it access to 10 features each time it's trying to split and decide which feature to split on this allows the model to better generalize what that means is that it's not always going to pick family income in every single tree now that we're restricting the features that it can split on at each point different decision trees will have access to a different set of these features and so we're going to introduce this variability this very necessary variability into our decision trees so that idea is called the random subspaces method so at each split we're only going to consider a subset of features and how do we know how many features to consider well these are some rules of thumb these are parameters that you're allowed to change these are just some rules of thumb if you're doing a classification problem people will typically pick the integer that's closest to square root of p so that's why i said that if we have a hundred features we typically allow 10 randomly chosen features at each step of the decision tree if you're doing a classification problem then people typically will use the integer closest to p divided by 3. so if this was a classification problem and we had p equals 100 features we might use something like 33 features but again these numbers are up to you there's something that you should actually vary and see how it changes the strength of your model so now just to recap these two modifications together the bagging idea and the random subspaces idea are what makes decision trees different from random forests and i want to make this really clear because here comes my opinions about random forests in the machine learning community i think there's a set of people who get a new data set and their first instinct is to just apply a random forest to it and that's not necessarily what you want to do the reason you don't want to do that is because if you don't know how a random forest works then you have no idea what you're actually doing and you might run into trouble down the road so i want everyone to fully understand how a random forest works and how it adds benefits on top of decision trees before you go ahead and just blindly apply it to whatever machine learning problem you might have so in a nutshell bagging allows us to introduce some variation on the rows so these n students and the random subspaces method allows us to introduce some variability in the columns or these p features and by introducing variability in these two dimensions that's where the random in random forests come from by introducing variability in both dimensions we allow the final model the final random forest to better generalize to students that it's never seen before students outside of these n students in our sample and now this video wouldn't be complete if i didn't talk about some of the cons of random forests so it still has most of the same pros we eliminate this tendency to overfit by considering many trees instead of just a single tree but there are two big cons i can think of the first is pretty obvious it's computational complexity so let's say that a single decision tree took you an hour to train well now you have a thousand decision trees so you do the math there it's gonna take a lot longer to train but the fact is that a single decision tree probably won't take you an hour with a strong enough computer it's probably gonna be somewhat tractable but i think the biggest con that people talk about and this is opinion number two of mine on how random force could talk about in the ml community there's another group of people who just won't touch random force at all they think that they're not interpretable decision trees were nice and interpretable why do we have to go and ruin them they won't even take a look at them so i will agree that the other con of random forest is that it does kind of take this interpretability away a little bit with a single decision tree we can just print out the decision tree and show someone and it's pretty obvious now we have like a thousand decision trees and it's harder to show everyone a thousand decision trees at a time but that doesn't mean that random forests have no interpretability the last thing we'll talk about is feature importance so we have these p features and we want to get some numerical measure about how important is each feature relative to the other features and if we have this we can say things like the gpa of the student was the most important predictor and maybe height was the least important predictor but how do we assign numerical values to this well it's a pretty simple four-step process and you can of course vary this in certain ways but here's the general flavor of the process the first step is we compute the accuracy on the ith training set so let me explain it in the pseudocode we randomly sample 80 20 we train on 80 and then we compute the accuracy on that training set so it's probably going to be really high because that's the exact training set used to build the model so it's probably going to be pretty strong the reason we do this is to compare it against the next thing that we do the next thing we do is we permute the jth feature so let me make things concrete for a second let's say we're trying to determine the importance of gpa versus the importance of height so let's look at gpa first we're going to permute the gpa of all the students permute basically means that we're randomly going to shuffle the gpa of all the students this would be disastrous for the model if gpa was an important feature right think about that what i'm saying is that if gpa was a very important feature a very important predictor in most of our decision trees if i were to randomly jumble all the gpas of students then my accuracy should be dropping by a lot and that's exactly what i do next i compute the accuracy again so i use the same decision trees so t1 t2 all the way to tb and i apply them on the modified training set the modified training set being the same training set where i permuted all the gpas now since gpa was an important feature in my model my accuracy is going to drop by a lot because it really needed that gpa to be there so i'm going to get the accuracy of the model on the unmodified training set and i'm going to subtract the accuracy on the permuted training set and i'm going to do this for every single training set from 1 2 3 all the way to b and take the average over all training sets so what i'm expecting is that since gpa is a really important feature this is going to be a big drop in accuracy for all or most of my decision tree so this difference is going to be pretty big now let's say i'm trying to find the feature importance for height i'm going to do the same thing get the accuracy on the ith training set i'm going to permute all of the heights but because height was not an important factor to begin with it probably won't matter that much so now when i get the accuracy on these permuted training sets it's not going to change by a lot because i wasn't using that height for anything anyway probably so that change in accuracy averaged over all i training sets is probably going to be rather low and this is the number that i use in order to judge the feature importance the more important features are the ones where this accuracy changes by a lot meaning that that feature was important for the model the less important features are the ones where the accuracy changes by barely anything which means that the models weren't really using that feature for anything anyway therefore it's not important so this is a very easy way to judge the importance of the features even using a model as complex as a random forest so what i would say to people who think random force are not interpretable i think there are ways to get the interoperability of random force don't be scared of them and for people who are just ready to apply a random force to your data without even thinking about it i would say really understand the theory so the bagging and random subspaces idea behind random force before you do that so hopefully this video helped to understand random forest why we use them and how they work any comments please leave them below if you like this video please like and subscribe for more videos just like this and i'll see you next
Original Description
How do random forests work?
Decision trees video: https://www.youtube.com/watch?v=kakLu2is3ds
Decision tree pruning video: https://www.youtube.com/watch?v=t56Nid85Thg
Overfitting video: https://www.youtube.com/watch?v=-JopeGg60QY
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from ritvikmath · ritvikmath · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Math Team Update
ritvikmath
Single Variable Calculus Volume of a Sphere - Proof 1
ritvikmath
Single Variable Calculus Volume of a Sphere - Proof 2
ritvikmath
Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
ritvikmath
Multivariable Calculus Volume of a Sphere Proof - Double Integrals
ritvikmath
The Euclidian Algorithm
ritvikmath
Proving the Chain Rule
ritvikmath
Proving the Fundamental Theorem of Calculus Part 1
ritvikmath
Proving the Fundamental Theorem of Calculus Part 2
ritvikmath
Math Puzzle - Poison Perplexity
ritvikmath
Math Puzzle - Poison Perplexity - Solution
ritvikmath
Expected Value and Variance of Continuous Random Variables (Calculus)
ritvikmath
Expected Value and Variance of Discrete Random Variables (No Calculus)
ritvikmath
Array Method
ritvikmath
Complex Power Series and their Derivatives
ritvikmath
Distributions - Intro
ritvikmath
The Poisson Distribution
ritvikmath
The Bernoulli Distribution
ritvikmath
The Binomial Distribution
ritvikmath
The Continuous Uniform Distribution
ritvikmath
The Geometric Distribution
ritvikmath
The Triangular Distribution
ritvikmath
The Exponential Distribution
ritvikmath
The Borel Distribution + Notes on Poisson Distribution
ritvikmath
The Gamma Distribution
ritvikmath
The Normal Distribution
ritvikmath
The Laplace Distribution
ritvikmath
The Chi - Squared Distribution
ritvikmath
Overfitting
ritvikmath
Vector Norms
ritvikmath
Truths Behind the Titanic : K-Nearest Neighbor
ritvikmath
The Mathematics of Breakups
ritvikmath
Sillyfish
ritvikmath
Finding Optimal Paths - Dynamic Programming
ritvikmath
HowToDataScience : Scraping Twitter Data
ritvikmath
Decision Trees
ritvikmath
Perceptron
ritvikmath
Naive Bayes
ritvikmath
K-Nearest Neighbor
ritvikmath
Evaluating Machine Learning Models
ritvikmath
Decision Tree Pruning
ritvikmath
K-Means Clustering
ritvikmath
Gaussian Mixture Model
ritvikmath
Data Science - Fuzzy Record Matching
ritvikmath
Time Series Talk : Autocorrelation and Partial Autocorrelation
ritvikmath
Time Series Talk : Autoregressive Model
ritvikmath
Time Series Talk : Moving Average Model
ritvikmath
Time Series Talk : ARMA Model
ritvikmath
Time Series Talk : ARCH Model
ritvikmath
Time Series Talk : White Noise
ritvikmath
Time Series Talk : Stationarity
ritvikmath
Time Series Talk : ARIMA Model
ritvikmath
Time Series Talk : Lag Operator
ritvikmath
Time Series Talk : What is Seasonality ?
ritvikmath
Time Series Talk : Seasonal ARIMA Model
ritvikmath
So ... What Actually is a Matrix ? : Data Science Basics
ritvikmath
Derivative of a Matrix : Data Science Basics
ritvikmath
Basics of PCA (Principal Component Analysis) : Data Science Concepts
ritvikmath
Eigenvalues & Eigenvectors : Data Science Basics
ritvikmath
The Covariance Matrix : Data Science Basics
ritvikmath
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Python
🎓
Tutor Explanation
DeepCamp AI