Code With Me : Decision Trees
Key Takeaways
The video demonstrates how to code a decision tree from scratch using Python, pandas, and matplotlib, covering topics such as data preprocessing, entropy calculation, and decision tree implementation.
Full Transcript
[Music] hey everyone welcome back so today we're going to be learning how to code a decision tree from scratch in python uh so the goal of this video is to show you how to go from the theory of a decision tree to actually coding your very first decision tree the goal of the video is not to give you any code that you're actually going to use in your machine learning projects and the reason is that when you think about languages like python and r they already have some pretty um efficient built-in libraries for decision trees and so you'll use that in practice but this is going to show you what's happening under the hood and so for today's example we'll use the very first example in our original decision trees theory video which is going to be linked below and that is trying to figure out if a mystery fish is a salmon or a tuna and let's say this is our data set so all the little red dots are salmon all the little blue dots are tuna and let's just say we have two variables to keep things simple for today the length of the fish and the weight of the fish we see that this is a pretty good candidate data set for a decision tree and the reason is that we see some pretty clear-cut horizontal and vertical boundaries that are going to separate our space really well for example if i asked you where's the best first split well that's actually a two-parter the first question is which variable do you want to split weight or length and the second follow-up question is which value do you want to split it at and i think the answer many of us would give is i want to split at length equals three and the reason we want to do that is because everything to the left of that where the length of the fish is less than three those are all categorized as salmon and on the right of that we can do a little bit more work so the next step would probably be let's check if weight is less than four because everything that is length greater than three but weight less than 4 is going to be categorized as a tuna and then above that we can do more work so this is generally how a decision tree works and so when we go start writing our code we're going to be looking for these same splits that we found intuitively all right so let's hop over to the code and start writing our first decision tree okay so i have the code here all i've done is imported some libraries that we're going to use i have pulled in the data frame called fish.csv where all of our data is living i'll make that available to you as well and this is what the data frame looks like so we have two variables length and weight and we have the type of the fish which is a string that's either tuna or salmon and below i've plotted the same exact graph we just saw before now before we get into any of the modeling i think something i want to do by the way i haven't thought about this problem too much i'm pretty much doing it in real time here um the first thing i want to do it's not going to be helpful to have this string be tuna or salmon let's change it to zeros and ones so we can work with it a lot easier so we'll do df dot type which is the type of the fish is going to be df.type.apply lambda t and then we'll do 1 if t is equal to salmon and then otherwise it's going to be zero so the supply function basically takes a whole column of your data frame this column currently consisting of either salmon or tuna and it applies the following transformation if that word is salmon it puts one otherwise it puts a zero so now if i run this and then i ask for the data frame again we see the type is now full of zeros and ones which are going to be easier to work with so now i think let's write some skeleton code for the decision tree and then i'm going to suspect we'll need some helper functions along the way but we'll worry about those a little later let me leave some room for those helper functions so let's say let's first define the features so features is equal to length and weight and the way we're going to go about it we're trying to find the first best split and again that's a two-part question the first one is which variable do you want to split on and the second one is which value of that variable do you want to split on so that sounds like two for loops to me so the first for loop will say for f in features so let's try every possible feature and then we're going to try every possible value of that feature starting at the lowest so let's do cur is equal to df f.min so let's get the lowest value of length first let's do step is equal to 0.1 so step is the increment along which i travel that variable the smaller this is the better because you're going to find the exact split but that also means that it's going to be a lot less efficient because it has to check a lot more splits so let's start with point one and see if it does the job for us and then we'll say while we'll say while cur is less than dff.max so basically we're going to stop when we reach the maximum value of this variable there's no point continuing after that and inside here is where we want to get the entropy from splitting this data frame at that split at that variable and at that current value and so that'll be a helper function but we'll say cur entropy is equal to get entropy from split and i think we'll need to pass in the data frame of course we'll need to pass in the feature we're using and we'll need to pass in the current value of that feature and so this function will give us back the resulting entropy if we split the data frame at that variable at that current value of that variable and now we'll need to check if the current entropy we just got is less than the best entropy so far because we're trying to find the best split and our metric is entropy so i think let's define a dictionary above that's going to collect the running best information so we'll say best params is equal to this dictionary should contain the feature that's the best so far it'll be none to start with it should contain the value of that feature that's the best to split on and we should also collect the entropy itself so let's say that's np.infinity so that whatever comes after that will be initialized as the best and so what we need to do is basically check if cur entropy if the current entropy we got from this particular situation is less than the best entropy so far so let's put that here that means this is even better than the best split we have so let's change everything in the best params dictionary to reflect that so say best params feature is going to be equal to f the current feature best params val is going to be equal to the current value and best params entropy will be equal to current entropy so now we have successfully changed all those if this was better than the current and now we'll of course need to increment curve so we don't get an infinite while loop so we're going to increment curve by step and then the while loop should terminate at some point and then the outer loop should terminate at some point and then at the end let's print out best params so let's give that a go well it's going to fail right now because we haven't initialized this function get entropy from split so let's define that next so def get entropy from split and we said it takes in a data frame it takes in a feature and it takes any value so this is again asking the question of if i split this data frame using that feature at that value what is the resulting entropy and so it's going to split into two sub data frames and we basically need to get the entropy from the left hand side and the right hand side and then take some kind of weighted average so let's first get the left hand side so we'll do left df is equal to df such that df feature is less than val and actually let's just get the type so left types dot types so what i'm doing here is saying that we're going to get the left hand side of the split and we're going to get all the types of fish over there and they're again zeros and ones so we're getting all the zeros and ones that live in the left hand side of this potential split we can also get right types in the exact same way so this would just be greater than or equal to and now we need to get the entropy of these two lists i think let's write one more helper function just to keep everything organized so we're going to say left entropy is going to be equal to get entropy a function we'll write in just a moment and we'll put in left types here and right entropy we'll be putting in here so again i'm getting the entropy from the left and right hand collections of types that live in those splits okay so now we basically need to take a weighted average of these guys and the weights are how many samples live in the left versus how many samples live in the right the proportion of those are going to be our weights so left prop is going to be the length of left types in other words how many samples got put on the left using this split and divide that by the length of the data frame so that's the proportion of samples that live on the left and right prop we could calculate it the same way but we can also do a little shortcut that says right prop is going to be you know what let's just use what i was going to do is 1 minus left prop because that would be correct but just to keep everything symmetric let's just do it this way so the length of the right types divided by lendia and then we'll simply return left prop times left entropy plus right crop times right entropy okay good so this we've got a function here and the last thing we need to do is define this get entropy function and so this is kind of our base function that's driving the whole algorithm this is just going to take in a list of zeros and ones and give us what is the entropy of that list okay so it's going to take in some vowels which is a list of zeros and ones and what it does first we'll need to calculate the proportion of zeros or actually proportion of ones that are in the list so we'll do p is equal to np dot mean vowels the reason this works is because if the list is known to contain only zeros and ones then asking for the mean of that list is the exact same question as asking for the proportion of ones in that list and go back to your formula on entropy if you've forgotten it but to get the entropy we would basically need to do return negative p times np dot log 2 p minus minus 1 minus p times np dot log to 1 minus p and even if you're not familiar with this formula i can show you that it does work and in fact we should check it to make sure i didn't make any mistakes here and one mistake i already see is that if vals is all zeros or all ones then p is gonna be zero or one and you can't take the log of zero so we'll need to include a default case that says if p equals equals zero or p equals equals one then return zero because a list that contains all the same number has exactly zero entropy so it's perfect information basically so let's give that a try and to test it out i'll do get entropy on let's say zero zero zero zero zero zero so this should give me zero of course if i had one one that should increase the entropy by a little bit in indeed the entropy increases if i add one more one that should increase the entropy again good that's working and if i add one more one this is the worst case scenario because we have zero information it's it's a perfect case of maximum entropy so yes that gives me one and i should see them go back down as i introduce more ones back to 0.91 back to 0.65 and this should give me zero okay good so that's working which is a good thing to check if that wasn't working everything would fail basically so that's working we got our second helper function here and then this should run now oops we have a data frame object has no attribute types so i meant to write the word type somewhere which is probably up here yeah this should be with no s and right okay i don't know how that got there that looks better okay so now it's running cool so we got our first answer so when we ran this we found that the best first place to split would be on the variable length and the value you should split at is three point it's really three but because of our step size it was a little bit above but this is telling us that the best first place to split if i scroll back up here is length is equal to three so i'll um impose this line in post-production but we see that matches up with our prior understanding and now just to close this video out and we see the entropy here is .68 and now just to close this video out how would we do the next split so i'm going to do it in kind of a hacky way just because this is a tutorial and not writing any production level code but what we need to do now is basically get the data frame that is on the right hand side of that split notice everything on the left hand side we're good to go it's correctly classified everything on the right hand side is where we need to continue doing work so let's just put in a little thing here that says cur df is equal to df.copy so we're not modifying the original data frame and then we'll say cur df is equal to cur df such that occur df dot length is greater than three okay so we have that now i need to put curdif everywhere that i put oops everywhere that i put df are those the only places oh no there's one more here okay let's give that a whirl now it says that after you do that initial length equals three split the next place you should split is weight is equal to four let's confirm that so on this right hand part of the data if we use weight which is the y variable equals four that's the next split and let's just do one more for fun because i think we're starting to get the idea it's just going to mean my condition here gets a little bit uglier again i want to emphasize that if you do choose to take this code i would strongly encourage you to modify it so that this is not a manual process you can automate everything i'm doing right here so if curt is greater than three and the weight is let's see what do i want to do here is greater than four yeah and that becomes my new data frame oops and now let's see now it says the next place you to split after that is length equals seven so if i go up here that would be length is equal to right here right here that makes sense because after i do that everything on the right hand side of that becomes red and everything on the left-hand side we can continue working on so i think i'll stop here um i think you're starting to get the idea if you want to code a decision tree from scratch or at least understand how it works under the hood it's checking all your different variables in this case we only had two but in general you have many more variables and it's going to do it in a more efficient way than this for loop technique we wrote but this for loop technique does identify how we're checking every variable checking various different splits of those variables finding the ones such that after we do that split the entropy has become minimized okay so um hopefully you learned just a little bit about how decision trees work under the hood if you did please subscribe for more videos just like this and i would love feedback on this particular code with me format because it's pretty new and yeah cool see you next time
Original Description
Coding a decision tree from scratch!
Decision Tree Video : https://www.youtube.com/watch?v=kakLu2is3ds
Link to Code: https://github.com/ritvikmath/YouTubeVideoCode/blob/main/Decision%20Tree%20Coding.ipynb
Link to Fish Data: https://github.com/ritvikmath/YouTubeVideoCode/blob/main/fish.csv
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from ritvikmath · ritvikmath · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Math Team Update
ritvikmath
Single Variable Calculus Volume of a Sphere - Proof 1
ritvikmath
Single Variable Calculus Volume of a Sphere - Proof 2
ritvikmath
Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
ritvikmath
Multivariable Calculus Volume of a Sphere Proof - Double Integrals
ritvikmath
The Euclidian Algorithm
ritvikmath
Proving the Chain Rule
ritvikmath
Proving the Fundamental Theorem of Calculus Part 1
ritvikmath
Proving the Fundamental Theorem of Calculus Part 2
ritvikmath
Math Puzzle - Poison Perplexity
ritvikmath
Math Puzzle - Poison Perplexity - Solution
ritvikmath
Expected Value and Variance of Continuous Random Variables (Calculus)
ritvikmath
Expected Value and Variance of Discrete Random Variables (No Calculus)
ritvikmath
Array Method
ritvikmath
Complex Power Series and their Derivatives
ritvikmath
Distributions - Intro
ritvikmath
The Poisson Distribution
ritvikmath
The Bernoulli Distribution
ritvikmath
The Binomial Distribution
ritvikmath
The Continuous Uniform Distribution
ritvikmath
The Geometric Distribution
ritvikmath
The Triangular Distribution
ritvikmath
The Exponential Distribution
ritvikmath
The Borel Distribution + Notes on Poisson Distribution
ritvikmath
The Gamma Distribution
ritvikmath
The Normal Distribution
ritvikmath
The Laplace Distribution
ritvikmath
The Chi - Squared Distribution
ritvikmath
Overfitting
ritvikmath
Vector Norms
ritvikmath
Truths Behind the Titanic : K-Nearest Neighbor
ritvikmath
The Mathematics of Breakups
ritvikmath
Sillyfish
ritvikmath
Finding Optimal Paths - Dynamic Programming
ritvikmath
HowToDataScience : Scraping Twitter Data
ritvikmath
Decision Trees
ritvikmath
Perceptron
ritvikmath
Naive Bayes
ritvikmath
K-Nearest Neighbor
ritvikmath
Evaluating Machine Learning Models
ritvikmath
Decision Tree Pruning
ritvikmath
K-Means Clustering
ritvikmath
Gaussian Mixture Model
ritvikmath
Data Science - Fuzzy Record Matching
ritvikmath
Time Series Talk : Autocorrelation and Partial Autocorrelation
ritvikmath
Time Series Talk : Autoregressive Model
ritvikmath
Time Series Talk : Moving Average Model
ritvikmath
Time Series Talk : ARMA Model
ritvikmath
Time Series Talk : ARCH Model
ritvikmath
Time Series Talk : White Noise
ritvikmath
Time Series Talk : Stationarity
ritvikmath
Time Series Talk : ARIMA Model
ritvikmath
Time Series Talk : Lag Operator
ritvikmath
Time Series Talk : What is Seasonality ?
ritvikmath
Time Series Talk : Seasonal ARIMA Model
ritvikmath
So ... What Actually is a Matrix ? : Data Science Basics
ritvikmath
Derivative of a Matrix : Data Science Basics
ritvikmath
Basics of PCA (Principal Component Analysis) : Data Science Concepts
ritvikmath
Eigenvalues & Eigenvectors : Data Science Basics
ritvikmath
The Covariance Matrix : Data Science Basics
ritvikmath
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · AI
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · Machine Learning
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · Data Science
Overfitting and Underfitting: When a Model Memorizes Too Much or Learns Too Little
Medium · Python
🎓
Tutor Explanation
DeepCamp AI