K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial
Key Takeaways
The video demonstrates the implementation of the K-Means clustering algorithm using Python and NumPy, covering the concept, math, and iterative optimization technique behind this popular ML algorithm. The tutorial provides a step-by-step guide on how to initialize cluster centers, update cluster labels and centroids, and determine convergence.
Full Transcript
hi everybody and welcome to a new machine learning from scratch tutorial today we are going to implement the k-means algorithm using only built-in Python modules and numpy the goal of the k-means is to cluster a data set into K different clusters and here we have a unlabeled data set so we are dealing with an unsupervised learning technique and each sample should be assigned to the cluster with the nearest mean so let's have a look at some images to see what this means so here we have our unlabeled data and now in this case we want to find three different clusters so it should look like this and then we assign the labels to the closest cluster so to the center of the closest cluster so yeah this is what we want to do and how are we going to achieve this so this is an iterative optimization technique so first we initialize our cluster centers so we randomly pick some samples and say these are our first centers and then we do these two steps until we are converged so first we update our cluster labels which means we assign the points to the nearest cluster Center and the cluster Center is also called centroid and next we update our centroids so now we set the center the new center to the mean of each cluster and we iteratively do this until there's no more change so let's again have a look at the images to see how this is working so first we have our unlabeled data set and now we randomly pick three centroids so here I've drawn them so I hope you can see them so these are our initial centroids and now we assign the labels to the of the data to the to the label of the closest centroid so this is our first initialization now we start optimizing so now we update our centroids we calculate the new mean for each cluster so I think this centroid will be moved to something like here and this will maybe maybe be here and the Green centroid will be moved to maybe here so let's see what's happening yeah so this is the new centroids and now we update our label our label so now we check which is the closest centroid for each label so maybe these will become blue and more of these will become orange so yeah this is the next step and now again we update our centroids so I think this one will further move two to the right and this one will move up here and this one will maybe stay the same so yeah this is the next step and now again we update the labels so I think these will become orange now and these will become blue and now we are almost done so maybe here are some slight shifts so again update the centroids then again the labels then the centroids then the labels and now there's no more change so now we are converged and yeah so this is the whole approach and the only math that we need for this is the Euclidean distance so I already showed you this in the tutorial about the K nearest neighbor algorithm the Euclidean distance between two vectors is defined as the square root of the sum over the squared distances so this is all we need so now we can start so let's import numpy s and P of course and then I will say I will set a random seat to let's say 40 - you don't need this but I want to rep my data later and since we are using a random initialization I want to to have the same results or I want to to reproduce my results so I will set a random seat here and now let's define our euclidean distance function first so Euclidean distance of two vectors X 1 and X 2 so this will be a global function and here we have to implement the formula that I just showed you and we can do this in one line so we say return and numpy first we have the square root off and then we have the sum numpy sum and then we have the sum over all the distances so we can say X 1 minus X 2 to the power of 2 so the squared distances so this is the function and now we can implement our k-means class so this will have an init which has self and then it will get a K so this will be the number of clusters and by default let's say it's 5 then it will get a max ITER's so this is the maximum number of iterations we want to do for our optimization and by default let's say it's 100 and then it will also get a boolean plot steps equals false so you don't need this but I'm going to implement an additional function here to plot the different steps like we've just seen so yeah so first we set them or we store them so we say self dot K equals K self dot max ITER's equals max ITER's and self dot plot steps equals plot steps and now we create our empty clusters and centroids so self dot clusters equals oops equals and now here we want to be careful because this is important so this is a list of lists so a list of sample indices for each cluster and in the beginning each cluster has an empty list so we say and we use list comprehension and then say we say have an empty list for underscore in range self dot k so for each cluster we initialize an empty list and then we say our self dot centroids equals and this will also be an empty list and here we are going to store the feature vectors or the mean feature vector for each cluster so mean feature vector for each cluster so here we are having actual samples and here we just stored the indices so this is important and now we can continue so now usually we would implement a fit and a predict method but since we are having a unsupervised learning technique here we and we don't have any label data we just have to implement the predict method so and we don't need this the fit method so let's define our predict method with self and X and here first let's store our data self x equals x and then the dimensions so self number of samples and self dot number of features equals x dot shape because we need this later and as always this is a numpy and d array and so yeah so now let's do the steps we just talked about so first we have to initialize our our centroids and then we do a optimization so here we can say for underscore in range self dot max ITER's and now here in our loop first we update our clusters so let's say update clusters and then we update the centroids and then we check for convergence check if converged and if so then we break and at the end we want to classify the samples as the index of their clusters so here we say return returned cluster labels so this is what we have to do so now let's start so let's say we want to initialize our centroids so we want to randomly pick some samples so let's say random sample indices equals and now we use list comprehension so or no sorry here I can use numpy dot random dot choice and this will get self dot number of samples and self dot k and we also have to say replace equals false because we don't want to pick the same indices twice so this will be an array of size self dot k and for each entry it will pick a random choice between zero and the number of samples and now we assign m the according sample that belongs to this index to our centroids so we say self table centroids equals and now you use list comprehension so we say self dot X of the current index for index in random sample indices so this is the initialization of our centroids and now we can do the optimization so first we say we update our clusters so we say self dot clusters equals self dot create clusters and this will get self table centroids so this is a helper function create clusters that we are going to implement now so define create clusters with self and the centroids so here we assign the samples to the closest centroids to create our clusters so first we have an empty list of Lists for our clusters and now we iterate over our data so we say for index and sample in enumerate self dot X so this enumerate function will give us the current index and the current sample now we want to get the closest centroid and we want to have the index of this and we say centroid index equals self dot closest centroid and this we have get the current sample and then it will get the centroid so this will be another helper function that will we'll create in a second but now let's continue here so when we have the centroid index we append or we take the current cluster so the clusters of this centroid index and then we append the current index so we put the current sample index in the closest cluster and then we return our clusters so this is how we create our clusters and now we need the define closest centroid function which gets self and a sample and the centroids and here we calculate the distances of the current sample to each centroid and then want to get the the centroid or the index of the centroid which has the the closest distance so let's calculate all the distance since with list comprehension so here we use the Euclidean distance function that we already have so Euclidean distance of a sample and of each centroid point so point four point in centroids and then we want me so we have all the distances here and now we want to see which is the minimum or the index with the minimum distance so we can use numpy art min here so we say closest index equals numpy art min of this distances and then we simply return it so return the closest index so now we have this and now we created our clusters and now we can continue with our optimization so here we up our centroids as the next steps but before we want to store the centroid so let's say centroids volts equals self dot centroids so that we can check for conversions later and then we say self dot centroids equals self dot get centroids of this and this will get self clusters so this is another function this will assign the mean value of the clusters to the centroid so for each cluster we now calculate the mean so let's define this define get nose define get centroids which get self and the clusters so here we initialize our centroids with zeroes in the beginning so let's say centroids equals numpy zeros and this will be of size self k and self dot and features and here this should be a two pulse and we have to be careful here so for each cluster we will store the feature electron so that's why it has to have this dimensions and now we iterate over the clusters so we say for cluster index and cluster in enumerate clusters and then we calculate the cluster means a cluster mean equals numpy of our self dot X of this cluster indices and this should be along the first axis so again let's have a look at what this means so our as I said our clusters is a list of lists so if we have just a current cluster then this is a list and this is a list of the indices that are in this cluster so if we call self dot X with this indices then it will only return the samples that are in the current clusters in the current cluster and then we calculate the mean so this is what what's going to happen here and now once we have our mean we assign it to the current cluster so to the current centroid so we say centroids of the current cluster index equals cluster mean and then we are done and can return the centroids so yeah so now we have our new centroids and now we check if we are converged so we say if self dot is converged then we will break so we can stop here and this will get the centroids old and the new new ones so self dot centroids and this is another helper function this will simply calculate the distances between each old and new centroids for all the centroids and check if this is 0 so define is converged self and centroids old and new centroids so again we calculate the distances with list comprehension henshin this turn says equals and here we calculate calculate the Euclidean distance of centroids old of I so the current centroid vector and the new one so centroids I for I in range self dot K so for each cluster it will look at the old and the new centroids vector and calculate the Euclidean distance and store it in this list and then we can return sum of distances equals equals zero so this is a built-in function that will iterate over this entries and calculate or sum it up and so if this is zero then there is no more change in our centroids so we say it is converged so yeah so now we have this and now we want to return the cluster labels so let's say return self dot get cluster labels and this will get self dot clusters so here for each sample we will get the label of the cluster it was assigned to so let's create this right here so define get cluster labels with self and the clusters and so first we say our labels equals a number I empty array of size [Music] self-taught number of samples so for each sample we want to return which is the the cluster it was assigned to but be careful here because these labels are not the actual labels of our data because we don't know them so this is just the index of the cluster it was assigned to so yeah so now we iterate over the clusters so for cluster index and also cluster in enumerate clusters and then we iterate over the current cluster so for sample index in cluster so again this is a list of Lists and each cluster has a list has all the the sample indices of the samples that are in this cluster so for each sample index in this current cluster we say labels of the sample index equals the current cluster index and then we return our labels and now we are done so one more thing that I want to implement but you don't need this is to have a plot function so define plot and self and here we are going to use matplotlib so I can say import much plot loop i plot as PLT and i'm not going to explain the details here but if you want to see more tutorials about my plot lip then please leave some comments so let's implement the plot method so here I simply want to plot and take the data and to which a cluster it belongs and then also the centroids so let's create our figure so thick and x equals P LT dots up plots and let's give this a fixed size off size let's say 12 by 8 and then we iterate over our clusters so for I and index in cell in in numerate self dot clusters and now we get the current points of point equals self dot X off the index but we have to transpose it here and now we scatter the point so a X dot scatter and here I unpack at the point and now so this will plot all the points and and for each cluster it will use a different color and now we do the we plot all the centroids so for point in self dot centroids a X dot scatter and again we unfold our point and then we say marker equals x so it has a marker sign and color equals black and line with equals to so and then we have to say PLT dot show of course and so this is the plot function and now during our optimization if we set plot steps to true then we want to plot after we updated our clusters so we say if self dot plot steps then self dot plot and again we also want to plot after we updated our centroids so we put it right here and now we are done so now let's run this and I hope that everything is working so let's clear this and run our script and oh I missed a comma here 937 update centroids of centroids all equals the current ones so again let's try this and now it's working so here we have I created a dataset with four different clusters and we can see that it correctly could you identify them so let's set plot steps to true and don't plot it at the end and now let's run it again to see the different steps so here this is after our initialization so we randomly picked some samples and set these are our first centroids so maybe the initialization is not very good here but it could find out the clusters correctly later so let's see what's happening so now we are so then we are starting to optimize so now in the next step we are calculating the new centroids so I think the the centroid of the blue cluster is moving to maybe over here and the orange centroid is moving to down here so let's look at the next step so yeah and now we are updating the labels so I think these here are no longer orange but red and green so yeah and also some of them are now red and now we update our centroids again so I think this we'll further move over here this will move down here and these will move a little bit so yeah and now we again update our labels so I think more of them will become red and more of them here will become green and so and now again we update our centroids so I think this will further move over here the red centroid will move over here this is already good and the green one or moves over here I think so yep that's that and now again we update the labels now again the centroids and the labels and now I think we are converged so yeah this is how the k-means works and I hope you understood everything and if you liked it please subscribe to the channel and see you next time bye
Original Description
Get my Free NumPy Handbook:
https://www.python-engineer.com/numpybook
In this Machine Learning from Scratch Tutorial, we are going to implement a K-Means algorithm using only built-in Python modules and numpy. We will also learn about the concept and the math behind this popular ML algorithm.
~~~~~~~~~~~~~~ GREAT PLUGINS FOR YOUR CODE EDITOR ~~~~~~~~~~~~~~
✅ Write cleaner code with Sourcery: https://sourcery.ai/?utm_source=youtube&utm_campaign=pythonengineer *
📓 Notebooks available on Patreon:
https://www.patreon.com/patrickloeber
⭐ Join Our Discord : https://discord.gg/FHMg9tKFSN
If you enjoyed this video, please subscribe to the channel!
The code can be found here:
https://github.com/patrickloeber/MLfromscratch
You can find me here:
Website: https://www.python-engineer.com
Twitter: https://twitter.com/patloeber
GitHub: https://github.com/patrickloeber
#Python #MachineLearning
----------------------------------------------------------------------------------------------------------
* This is a sponsored link. By clicking on it you will not have any additional costs, instead you will support me and my project. Thank you so much for the support! 🙏
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Patrick Loeber · Patrick Loeber · 33 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
▶
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Lists in Python - Advanced Python 01 - Programming Tutorial
Patrick Loeber
Tuples in Python - Advanced Python 02 - Programming Tutorial
Patrick Loeber
Dictionaries in Python - Advanced Python 03 - Programming Tutorial
Patrick Loeber
Sets in Python - Advanced Python 04 - Programming Tutorial
Patrick Loeber
Strings in Python - Advanced Python 05 - Programming Tutorial
Patrick Loeber
Collections in Python - Advanced Python 06 - Programming Tutorial
Patrick Loeber
Itertools in Python - Advanced Python 07 - Programming Tutorial
Patrick Loeber
Lambda in Python - Advanced Python 08 - Programming Tutorial - Map Filter Reduce
Patrick Loeber
Exceptions in Python - Advanced Python 09 - Programming Tutorial
Patrick Loeber
Logging in Python - Advanced Python 10 - Programming Tutorial
Patrick Loeber
JSON in Python - Advanced Python 11 - Programming Tutorial
Patrick Loeber
Random Numbers in Python - Advanced Python 12 - Programming Tutorial
Patrick Loeber
Decorators in Python - Advanced Python 13 - Programming Tutorial
Patrick Loeber
Generators in Python - Advanced Python 14 - Programming Tutorial
Patrick Loeber
Threading vs Multiprocessing in Python - Advanced Python 15 - Programming Tutorial
Patrick Loeber
Threading in Python - Advanced Python 16 - Programming Tutorial
Patrick Loeber
Multiprocessing in Python - Advanced Python 17 - Programming Tutorial
Patrick Loeber
Function arguments in detail - Advanced Python 18 - Programming Tutorial
Patrick Loeber
The asterisk (*) operator in Python - Advanced Python 19 - Programming Tutorial
Patrick Loeber
Shallow vs Deep Copying in Python - Advanced Python 20 - Programming Tutorial
Patrick Loeber
Context Managers in Python - Advanced Python 21 - Programming Tutorial
Patrick Loeber
KNN (K Nearest Neighbors) in Python - Machine Learning From Scratch 01 - Python Tutorial
Patrick Loeber
Linear Regression in Python - Machine Learning From Scratch 02 - Python Tutorial
Patrick Loeber
Logistic Regression in Python - Machine Learning From Scratch 03 - Python Tutorial
Patrick Loeber
Linear and Logistic Regression in 60 lines of Python - Machine Learning From Scratch 04
Patrick Loeber
Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial
Patrick Loeber
Perceptron in Python - Machine Learning From Scratch 06 - Python Tutorial
Patrick Loeber
SVM (Support Vector Machine) in Python - Machine Learning From Scratch 07 - Python Tutorial
Patrick Loeber
Decision Tree in Python Part 1/2 - Machine Learning From Scratch 08 - Python Tutorial
Patrick Loeber
Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial
Patrick Loeber
Random Forest in Python - Machine Learning From Scratch 10 - Python Tutorial
Patrick Loeber
PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial
Patrick Loeber
K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial
Patrick Loeber
Anaconda Tutorial - Installation and Basic Commands
Patrick Loeber
PyTorch Tutorial 01 - Installation
Patrick Loeber
PyTorch Tutorial 02 - Tensor Basics
Patrick Loeber
PyTorch Tutorial 03 - Gradient Calculation With Autograd
Patrick Loeber
PyTorch Tutorial 04 - Backpropagation - Theory With Example
Patrick Loeber
PyTorch Tutorial 05 - Gradient Descent with Autograd and Backpropagation
Patrick Loeber
PyTorch Tutorial 06 - Training Pipeline: Model, Loss, and Optimizer
Patrick Loeber
PyTorch Tutorial 07 - Linear Regression
Patrick Loeber
PyTorch Tutorial 08 - Logistic Regression
Patrick Loeber
PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training
Patrick Loeber
PyTorch Tutorial 10 - Dataset Transforms
Patrick Loeber
Download Images With Python Automatically - Python Web Scraping Tutorial
Patrick Loeber
PyTorch Tutorial 11 - Softmax and Cross Entropy
Patrick Loeber
Select Movies with Python - Web Scraping Tutorial
Patrick Loeber
PyTorch Tutorial 12 - Activation Functions
Patrick Loeber
List Comprehension in Python - A Python Feature You MUST KNOW - Python Tutorial
Patrick Loeber
PyTorch Tutorial 13 - Feed-Forward Neural Network
Patrick Loeber
How To Add A Progress Bar In Python With Just One Line - Python Tutorial
Patrick Loeber
PyTorch Tutorial 14 - Convolutional Neural Network (CNN)
Patrick Loeber
The Walrus Operator - New in Python 3.8 - Python Tutorial
Patrick Loeber
PyTorch Tutorial 15 - Transfer Learning
Patrick Loeber
YouTube Data API Tutorial with Python - Analyze Channel Statistics - Part 1
Patrick Loeber
YouTube Data API Tutorial with Python - Find Channel Videos - Part 2
Patrick Loeber
YouTube Data API Tutorial with Python - Get Video Statistics - Part 3
Patrick Loeber
YouTube Data API Tutorial with Python - Analyze the Data - Part 4
Patrick Loeber
AdaBoost in Python - Machine Learning From Scratch 13 - Python Tutorial
Patrick Loeber
Ultimate FREE Study Guide for Machine Learning and Deep Learning
Patrick Loeber
More on: Unsupervised Learning
View skill →
🎓
Tutor Explanation
DeepCamp AI