Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial
Key Takeaways
This video tutorial implements a Decision Tree algorithm from scratch in Python using NumPy and scikit-learn, covering the concept, math, and implementation of the algorithm.
Full Transcript
hi everybody welcome to the second part of the decision tree tutorial if you haven't watched the first part then please do so because there I will explain the theory so here we continue with the implementation and we can start right away so we say import numpy s and p and then before we implement the decision tree class we will first create our entropy method to calculate the entropy and we implement this as global functions so we say define entropy and this will get a vector Y of all our class labels and let's have a look at the formula so we have to calculate the number of occurrences and we can do this with a function and we call this hist or histogram and we can use numpy pin count of Y so this will calculate the number of occurrences of all class labels and then we divide them by the number of total samples so and we say piece equals hist divided by length of Y and then we apply the actual formula so we say minus the sum of P of X times the lock of P of X so we can do this in one line and say return minus numpy sum and here we use list comprehensions so we can say P times numpy lock 2 of P for all peace in PPS and we also have to use a condition if P is greater than 0 because the lock is not defined for negative numbers so this is the entropy and now we also implement helped a class and call this note so here we will store the information for our note so this will get an init which gets self and then let's have a look at this so if we are in the middle then we want to store the best split feature and the best split threshold and we also want to store the left and the right child trees because we need them later and if we add and now if we are at a leaf node then we also want to store the actual value here so the most common class label so we say feature equals none threshold equals none left equals none right equals none and then we use a little trick here so we use an asterisk and a comma and then we say value equals none so now if we want to use this value parameter we have to use it as a keyword only parameter so later when we create our leaf node which only gets the value and we also have to write value equals something so then it's clear that this is a leaf node and here we simply store them so we say self feature equals feature self threshold equals threshold self left equals left self right equals right and self value equals value and now we also create a little helper function to determine if we are at a leaf node so define is leaf know which gets elf and here we simply say if if we are if we have a value then we are at a leaf node and otherwise not so we return self that value is not none so if we have a well you and we return true and this is our helper class for the notes and now we can start with the actual decision tree class so this also gets in in it which gets self and then it will get some stopping criteria so we call this min samples split and by default let's say this is two so the minimum samples required to further split our tree then the max depth and by default this is 100 and then also it gets a parameter that we call the number of features or n feets and this is none so we don't need this but we can specify it so as I said we do a greedy search over all the features but we can also only loop over a subset of number of features and then we randomly select this subset so this is one of the random factors and it's also one of the reasons why it's called random forest if we extend our decision trees to a random forest so this is one random factor and now we simply store them so we say self dot min sample split equals min samples sorry min sample split self dot max depth equals max depth self dot and features equals and features and we also create a route and but in the beginning this is none so we later need to know our route so that we know where we should start traversing our tree and now we implement the fit method which gets the training data and training labels so here we want to grow our tree and then it gets the predict method with test labels so here we want to traverse our tree so let's start with growing our tree so we say self dot root equals and now we call and create a helper function self dot grow tree which gets x and y and we also apply a safety check so we say self dot and feet feets equals x dot shape of one so this is a numpy and dra and the second dimension is the number of features if not self dot and feets so if this is not specified if this is none then we simply take the maximum number of features and otherwise we take the minimum of self and feet and X dot shape one so this just makes sure that it can never be greater than the actual number of features and now we implement the grow tree method which gets self and then it gets x and y and also also a depth this is zero in the beginning so we need to keep track of the depth and now let's do this so first let's get the number of samples and the number of features this is X dot shape and then we also want to get the number of different labels so this is the length of numpy unique of Y so all the different labels and now first what we do here is first we apply our stopping criteria Tyria so we say if and now let's again have a look what we said so we want to check for the maximum depth then the minimum samples required and if we have no more class distribution so we say if and we say depth is greater or equal themself max depth or if the number of different labels equals one so if we have only one class at this node or if we have number of samples is smaller than the minimum samples required so if this is true then we are at a leaf node so we say it's leaf value equals self dot most common label of this Y and now we create and return our leaf node so we say not return note and then we have to say value equals leaf value and now it gets clear why I use this asterisk so here I have to use the value as keyword and now it's clear that this is a leaf node and now we also need this helper function to say to get the most common label so let's write this down here so define most common label which gets self and then it gets a vector of the class labels and for this we use a Python module the counter modules we say from collections import counter so I talked about this in previous videos already please check that out if you're not familiar with the counter module so here we can create a counter object counter equals counter of Y so this will calculate all the the number of occurrences for all the Y's similar to the numpy pin count and then we have a most common function so we say most common equals counter dot most common of one so we only need the very most common label and this will return a list of tuples so we want to have the first element of the list so this is the very most common tuple and in the tuple there is the value stored and also the number of occurrences so we only need the value so we again say index 0 so please double check that for yourself and then we return this this will determine the most common label so now we have that here and now if we didn't meet the stopping criteria then we continue so first we select the feature indices so feature indices equals numpy random choice and this will get the number of features so it will select random numbers from between 0 and the number of features and the error array should be of size self dot and feets that we specified and we also say replace equals false because we don't have don't want to have the same indices multiple times and now with we do our greedy search so we say best Thresh and best well let's do it the other way around best feature and best Thresh equals self dot best criteria and this will get x and y and all the feature indices so this is another helper method so let's define it here define itself best criteria which gets self and x and y and the feature indices so here we do our greedy search so we say best gain equals minus 1 so we want to go over all the features and all the feature values and calculate the information gain so now let's say split index and split threshold equals none so both of them are none and then we have our first loop so for feature index in feature indices and now we only want to select the column vector of this X so we say X X column equals x off and we want to have all samples but only at this index and then we want to go over all the possible thresholds so we say thresholds equals numpy unique of X column so we don't want to check the same value twice so like in this example here we would check for 30 15 5 10 20 and 25 and now we go over all the possible thresholds so we say for Thresh hold in thresholds and now we calculate information gains we say gain equals self dot in for mation gain which gets why and the column vector and also the current thresholds and then we say if gain is greater than our best gain then our new best gain is the current gain and our best split index is the current feature index and our best split threshold is the current threshold and at the end we want to return them so we return the split index first and then the split threshold so this is the greedy search and now we need another helper function to calculate the information gained so let's say define inform a gain itself and then it gets why and an X column and the threshold so let's call a split Thresh and now let's say let's have a look at the formula again so mmm sorry so we want to calculate the entropy of the parent and then a weighted average of the children so we calculate the parent entropy then we generate a split then we calculate the weighted average of the child entropies and then we return the information gained so the parent entropy we can simply say parent entropy equals entropy because we already have this function so parent entropy of what entropy of Y then we generate our split so we say left indices and right indices equals self dot split and we split based on this X column and a split threshold so we need another helper function so I hope you can still follow me I hope I try to explain everything as clear as possible but the code is a little bit more complex than usual but yeah let's continue so let's say define split self and then it gets X column and our split threshold and here we can say left indices so here we apply our question and we can use a function that is called numpy arc where and here we asked if the value of x column is smaller or equal than our split threshold so this will return an array where with all wet all the these conditions are true for all the values in our X column and we want to flatten this array because we only want a 1d vector to please check it for yourself and we do the same for the right indices so we say numpy arc where and here we check if X column is greater then our split threshold and then we flatten this and then we return the left indices and the right indices so now we have the split function and now if we have to with this function we generate the splits here and first we check if length of the left indices is 0 or the length of the right indices is 0 then we can immediately return 0 as information gain and otherwise we continue so we calculate the weighted average so we need the number of total occurrences of the number of our samples so this is length of y and then the number of our left samples and the number of our right samples equals length left indices and length right indices then we calculate the entropy so we say e l and e r equals entropy of Y of all the left indices and entropy of Y of all the right indices and now our child entropy equals the weighted average so we say n L divided by n times the left entropy plus and now n R divided by n times the right entropy so this is the child entropy and now let's calculate the information gain this is just the parent entropy minus the child entropy and now we return it so returned the information gained so now we have this so now this function is complete and now we can have to continue with the growing so now after we have selected the best criteria we want to split our tree with this best feature and best threshold so we say left indices and right indices equals self dot split so again here we can use our split function so we say X off and now we have to be careful so now we want to have all samples but only the best feature index so this is a column and here we put in the best threshold and now with our left and right indices we can continue growing so we go to the left and say left equals and we call this function here in itself recursively so we say self dot grow tree and here we say X and then we only want the left indices but we want all the features so and that's why we want only want the left indices and then we also say depth plus one so now our depth is increased by one and now we do the same thing for the right side so we say write equals grow tree with the right indices and now we return a new note in the middle so this will get the best feature the best threshold and the left and the right child but no value here so this is the growing method and now the only thing left is to implement the predict method and I promise this will be fairly easy so we only need one helper function so we say return numpy array and here again we use list comprehensions so we traverse our tree so self dot Traverse tree where we put in one sample for all the samples in capital X so this is our predict method and now let's implement the Traverse tree method as last thing so this will get self and one sample and it also gets a note where we start so we have to put in self dot root in the beginning because we start at the top and here we also do this recursively so first we can check for the stopping criteria so we check if we have reached a leaf node so we can say if node is leaf node so that's why we implemented this helper function then we return note that value because if we are at a leaf we also have a value and otherwise we apply our question so we go to the left or the right so we say if X of this note the feature index that we start is smaller or equal then the threshold that we start at this node then we return self dot Traverse tree so we go to the left side off and we put in X and note left so that's why we start the left and the right notes at the notes and otherwise we return self dot Traverse tree of X and no dot right so yeah this is the whole Traverse implementation and now we are finally done this is the whole implementation that we need for a decision tree and let's check this so I've little riddle I've written a little helper script a test script where I implement this decision tree class and I use the data set the breast cancer data set from the scikit-learn module and I will split our data in training labels and and training samples and test levels and test samples then I will create our decision tree and fit the data then predicted test data and calculate the accuracy so let's check let's run this script and check if everything's working needs a little time to mmm-hmm sorry has Traverse tree missing one positional argument node self dot Traverse tree X oh self dot root must be here not here so let's clear this and give this another try and fingers trusts note object has no attribute feature index so this is how did we call it just feature sorry if note that feature threshold threshold threshold so one more time I hope I have no more typos so yeah now we have the accuracy and everything's working and yeah I hope you enjoyed this tutorial if you liked it please subscribe to the channel and in the next tutorial we will continue by extending this decision tree to the random forest model so yeah see you then bye
Original Description
Get my Free NumPy Handbook:
https://www.python-engineer.com/numpybook
In this Machine Learning from Scratch Tutorial, we are going to implement a Decision Tree algorithm using only built-in Python modules and numpy. We will also learn about the concept and the math behind this popular ML algorithm.
Part 1 will cover the theory, and Part 2 contains the implementation.
~~~~~~~~~~~~~~ GREAT PLUGINS FOR YOUR CODE EDITOR ~~~~~~~~~~~~~~
✅ Write cleaner code with Sourcery: https://sourcery.ai/?utm_source=youtube&utm_campaign=pythonengineer *
📓 Notebooks available on Patreon:
https://www.patreon.com/patrickloeber
⭐ Join Our Discord : https://discord.gg/FHMg9tKFSN
If you enjoyed this video, please subscribe to the channel!
The code can be found here:
https://github.com/patrickloeber/MLfromscratch
You can find me here:
Website: https://www.python-engineer.com
Twitter: https://twitter.com/patloeber
GitHub: https://github.com/patrickloeber
#Python #MachineLearning
----------------------------------------------------------------------------------------------------------
* This is a sponsored link. By clicking on it you will not have any additional costs, instead you will support me and my project. Thank you so much for the support! 🙏
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Patrick Loeber · Patrick Loeber · 30 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
▶
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Lists in Python - Advanced Python 01 - Programming Tutorial
Patrick Loeber
Tuples in Python - Advanced Python 02 - Programming Tutorial
Patrick Loeber
Dictionaries in Python - Advanced Python 03 - Programming Tutorial
Patrick Loeber
Sets in Python - Advanced Python 04 - Programming Tutorial
Patrick Loeber
Strings in Python - Advanced Python 05 - Programming Tutorial
Patrick Loeber
Collections in Python - Advanced Python 06 - Programming Tutorial
Patrick Loeber
Itertools in Python - Advanced Python 07 - Programming Tutorial
Patrick Loeber
Lambda in Python - Advanced Python 08 - Programming Tutorial - Map Filter Reduce
Patrick Loeber
Exceptions in Python - Advanced Python 09 - Programming Tutorial
Patrick Loeber
Logging in Python - Advanced Python 10 - Programming Tutorial
Patrick Loeber
JSON in Python - Advanced Python 11 - Programming Tutorial
Patrick Loeber
Random Numbers in Python - Advanced Python 12 - Programming Tutorial
Patrick Loeber
Decorators in Python - Advanced Python 13 - Programming Tutorial
Patrick Loeber
Generators in Python - Advanced Python 14 - Programming Tutorial
Patrick Loeber
Threading vs Multiprocessing in Python - Advanced Python 15 - Programming Tutorial
Patrick Loeber
Threading in Python - Advanced Python 16 - Programming Tutorial
Patrick Loeber
Multiprocessing in Python - Advanced Python 17 - Programming Tutorial
Patrick Loeber
Function arguments in detail - Advanced Python 18 - Programming Tutorial
Patrick Loeber
The asterisk (*) operator in Python - Advanced Python 19 - Programming Tutorial
Patrick Loeber
Shallow vs Deep Copying in Python - Advanced Python 20 - Programming Tutorial
Patrick Loeber
Context Managers in Python - Advanced Python 21 - Programming Tutorial
Patrick Loeber
KNN (K Nearest Neighbors) in Python - Machine Learning From Scratch 01 - Python Tutorial
Patrick Loeber
Linear Regression in Python - Machine Learning From Scratch 02 - Python Tutorial
Patrick Loeber
Logistic Regression in Python - Machine Learning From Scratch 03 - Python Tutorial
Patrick Loeber
Linear and Logistic Regression in 60 lines of Python - Machine Learning From Scratch 04
Patrick Loeber
Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial
Patrick Loeber
Perceptron in Python - Machine Learning From Scratch 06 - Python Tutorial
Patrick Loeber
SVM (Support Vector Machine) in Python - Machine Learning From Scratch 07 - Python Tutorial
Patrick Loeber
Decision Tree in Python Part 1/2 - Machine Learning From Scratch 08 - Python Tutorial
Patrick Loeber
Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial
Patrick Loeber
Random Forest in Python - Machine Learning From Scratch 10 - Python Tutorial
Patrick Loeber
PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial
Patrick Loeber
K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial
Patrick Loeber
Anaconda Tutorial - Installation and Basic Commands
Patrick Loeber
PyTorch Tutorial 01 - Installation
Patrick Loeber
PyTorch Tutorial 02 - Tensor Basics
Patrick Loeber
PyTorch Tutorial 03 - Gradient Calculation With Autograd
Patrick Loeber
PyTorch Tutorial 04 - Backpropagation - Theory With Example
Patrick Loeber
PyTorch Tutorial 05 - Gradient Descent with Autograd and Backpropagation
Patrick Loeber
PyTorch Tutorial 06 - Training Pipeline: Model, Loss, and Optimizer
Patrick Loeber
PyTorch Tutorial 07 - Linear Regression
Patrick Loeber
PyTorch Tutorial 08 - Logistic Regression
Patrick Loeber
PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training
Patrick Loeber
PyTorch Tutorial 10 - Dataset Transforms
Patrick Loeber
Download Images With Python Automatically - Python Web Scraping Tutorial
Patrick Loeber
PyTorch Tutorial 11 - Softmax and Cross Entropy
Patrick Loeber
Select Movies with Python - Web Scraping Tutorial
Patrick Loeber
PyTorch Tutorial 12 - Activation Functions
Patrick Loeber
List Comprehension in Python - A Python Feature You MUST KNOW - Python Tutorial
Patrick Loeber
PyTorch Tutorial 13 - Feed-Forward Neural Network
Patrick Loeber
How To Add A Progress Bar In Python With Just One Line - Python Tutorial
Patrick Loeber
PyTorch Tutorial 14 - Convolutional Neural Network (CNN)
Patrick Loeber
The Walrus Operator - New in Python 3.8 - Python Tutorial
Patrick Loeber
PyTorch Tutorial 15 - Transfer Learning
Patrick Loeber
YouTube Data API Tutorial with Python - Analyze Channel Statistics - Part 1
Patrick Loeber
YouTube Data API Tutorial with Python - Find Channel Videos - Part 2
Patrick Loeber
YouTube Data API Tutorial with Python - Get Video Statistics - Part 3
Patrick Loeber
YouTube Data API Tutorial with Python - Analyze the Data - Part 4
Patrick Loeber
AdaBoost in Python - Machine Learning From Scratch 13 - Python Tutorial
Patrick Loeber
Ultimate FREE Study Guide for Machine Learning and Deep Learning
Patrick Loeber
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Bloom Filters, Explained Properly
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Medium · Python
🎓
Tutor Explanation
DeepCamp AI