K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial

Patrick Loeber · Beginner ·📐 ML Fundamentals ·6y ago

Skills: Unsupervised Learning90%ML Maths Basics80%

Key Takeaways

The video demonstrates the implementation of the K-Means clustering algorithm using Python and NumPy, covering the concept, math, and iterative optimization technique behind this popular ML algorithm. The tutorial provides a step-by-step guide on how to initialize cluster centers, update cluster labels and centroids, and determine convergence.

Full Transcript

hi everybody and welcome to a new machine learning from scratch tutorial today we are going to implement the k-means algorithm using only built-in Python modules and numpy the goal of the k-means is to cluster a data set into K different clusters and here we have a unlabeled data set so we are dealing with an unsupervised learning technique and each sample should be assigned to the cluster with the nearest mean so let's have a look at some images to see what this means so here we have our unlabeled data and now in this case we want to find three different clusters so it should look like this and then we assign the labels to the closest cluster so to the center of the closest cluster so yeah this is what we want to do and how are we going to achieve this so this is an iterative optimization technique so first we initialize our cluster centers so we randomly pick some samples and say these are our first centers and then we do these two steps until we are converged so first we update our cluster labels which means we assign the points to the nearest cluster Center and the cluster Center is also called centroid and next we update our centroids so now we set the center the new center to the mean of each cluster and we iteratively do this until there's no more change so let's again have a look at the images to see how this is working so first we have our unlabeled data set and now we randomly pick three centroids so here I've drawn them so I hope you can see them so these are our initial centroids and now we assign the labels to the of the data to the to the label of the closest centroid so this is our first initialization now we start optimizing so now we update our centroids we calculate the new mean for each cluster so I think this centroid will be moved to something like here and this will maybe maybe be here and the Green centroid will be moved to maybe here so let's see what's happening yeah so this is the new centroids and now we update our label our label so now we check which is the closest centroid for each label so maybe these will become blue and more of these will become orange so yeah this is the next step and now again we update our centroids so I think this one will further move two to the right and this one will move up here and this one will maybe stay the same so yeah this is the next step and now again we update the labels so I think these will become orange now and these will become blue and now we are almost done so maybe here are some slight shifts so again update the centroids then again the labels then the centroids then the labels and now there's no more change so now we are converged and yeah so this is the whole approach and the only math that we need for this is the Euclidean distance so I already showed you this in the tutorial about the K nearest neighbor algorithm the Euclidean distance between two vectors is defined as the square root of the sum over the squared distances so this is all we need so now we can start so let's import numpy s and P of course and then I will say I will set a random seat to let's say 40 - you don't need this but I want to rep my data later and since we are using a random initialization I want to to have the same results or I want to to reproduce my results so I will set a random seat here and now let's define our euclidean distance function first so Euclidean distance of two vectors X 1 and X 2 so this will be a global function and here we have to implement the formula that I just showed you and we can do this in one line so we say return and numpy first we have the square root off and then we have the sum numpy sum and then we have the sum over all the distances so we can say X 1 minus X 2 to the power of 2 so the squared distances so this is the function and now we can implement our k-means class so this will have an init which has self and then it will get a K so this will be the number of clusters and by default let's say it's 5 then it will get a max ITER's so this is the maximum number of iterations we want to do for our optimization and by default let's say it's 100 and then it will also get a boolean plot steps equals false so you don't need this but I'm going to implement an additional function here to plot the different steps like we've just seen so yeah so first we set them or we store them so we say self dot K equals K self dot max ITER's equals max ITER's and self dot plot steps equals plot steps and now we create our empty clusters and centroids so self dot clusters equals oops equals and now here we want to be careful because this is important so this is a list of lists so a list of sample indices for each cluster and in the beginning each cluster has an empty list so we say and we use list comprehension and then say we say have an empty list for underscore in range self dot k so for each cluster we initialize an empty list and then we say our self dot centroids equals and this will also be an empty list and here we are going to store the feature vectors or the mean feature vector for each cluster so mean feature vector for each cluster so here we are having actual samples and here we just stored the indices so this is important and now we can continue so now usually we would implement a fit and a predict method but since we are having a unsupervised learning technique here we and we don't have any label data we just have to implement the predict method so and we don't need this the fit method so let's define our predict method with self and X and here first let's store our data self x equals x and then the dimensions so self number of samples and self dot number of features equals x dot shape because we need this later and as always this is a numpy and d array and so yeah so now let's do the steps we just talked about so first we have to initialize our our centroids and then we do a optimization so here we can say for underscore in range self dot max ITER's and now here in our loop first we update our clusters so let's say update clusters and then we update the centroids and then we check for convergence check if converged and if so then we break and at the end we want to classify the samples as the index of their clusters so here we say return returned cluster labels so this is what we have to do so now let's start so let's say we want to initialize our centroids so we want to randomly pick some samples so let's say random sample indices equals and now we use list comprehension so or no sorry here I can use numpy dot random dot choice and this will get self dot number of samples and self dot k and we also have to say replace equals false because we don't want to pick the same indices twice so this will be an array of size self dot k and for each entry it will pick a random choice between zero and the number of samples and now we assign m the according sample that belongs to this index to our centroids so we say self table centroids equals and now you use list comprehension so we say self dot X of the current index for index in random sample indices so this is the initialization of our centroids and now we can do the optimization so first we say we update our clusters so we say self dot clusters equals self dot create clusters and this will get self table centroids so this is a helper function create clusters that we are going to implement now so define create clusters with self and the centroids so here we assign the samples to the closest centroids to create our clusters so first we have an empty list of Lists for our clusters and now we iterate over our data so we say for index and sample in enumerate self dot X so this enumerate function will give us the current index and the current sample now we want to get the closest centroid and we want to have the index of this and we say centroid index equals self dot closest centroid and this we have get the current sample and then it will get the centroid so this will be another helper function that will we'll create in a second but now let's continue here so when we have the centroid index we append or we take the current cluster so the clusters of this centroid index and then we append the current index so we put the current sample index in the closest cluster and then we return our clusters so this is how we create our clusters and now we need the define closest centroid function which gets self and a sample and the centroids and here we calculate the distances of the current sample to each centroid and then want to get the the centroid or the index of the centroid which has the the closest distance so let's calculate all the distance since with list comprehension so here we use the Euclidean distance function that we already have so Euclidean distance of a sample and of each centroid point so point four point in centroids and then we want me so we have all the distances here and now we want to see which is the minimum or the index with the minimum distance so we can use numpy art min here so we say closest index equals numpy art min of this distances and then we simply return it so return the closest index so now we have this and now we created our clusters and now we can continue with our optimization so here we up our centroids as the next steps but before we want to store the centroid so let's say centroids volts equals self dot centroids so that we can check for conversions later and then we say self dot centroids equals self dot get centroids of this and this will get self clusters so this is another function this will assign the mean value of the clusters to the centroid so for each cluster we now calculate the mean so let's define this define get nose define get centroids which get self and the clusters so here we initialize our centroids with zeroes in the beginning so let's say centroids equals numpy zeros and this will be of size self k and self dot and features and here this should be a two pulse and we have to be careful here so for each cluster we will store the feature electron so that's why it has to have this dimensions and now we iterate over the clusters so we say for cluster index and cluster in enumerate clusters and then we calculate the cluster means a cluster mean equals numpy of our self dot X of this cluster indices and this should be along the first axis so again let's have a look at what this means so our as I said our clusters is a list of lists so if we have just a current cluster then this is a list and this is a list of the indices that are in this cluster so if we call self dot X with this indices then it will only return the samples that are in the current clusters in the current cluster and then we calculate the mean so this is what what's going to happen here and now once we have our mean we assign it to the current cluster so to the current centroid so we say centroids of the current cluster index equals cluster mean and then we are done and can return the centroids so yeah so now we have our new centroids and now we check if we are converged so we say if self dot is converged then we will break so we can stop here and this will get the centroids old and the new new ones so self dot centroids and this is another helper function this will simply calculate the distances between each old and new centroids for all the centroids and check if this is 0 so define is converged self and centroids old and new centroids so again we calculate the distances with list comprehension henshin this turn says equals and here we calculate calculate the Euclidean distance of centroids old of I so the current centroid vector and the new one so centroids I for I in range self dot K so for each cluster it will look at the old and the new centroids vector and calculate the Euclidean distance and store it in this list and then we can return sum of distances equals equals zero so this is a built-in function that will iterate over this entries and calculate or sum it up and so if this is zero then there is no more change in our centroids so we say it is converged so yeah so now we have this and now we want to return the cluster labels so let's say return self dot get cluster labels and this will get self dot clusters so here for each sample we will get the label of the cluster it was assigned to so let's create this right here so define get cluster labels with self and the clusters and so first we say our labels equals a number I empty array of size [Music] self-taught number of samples so for each sample we want to return which is the the cluster it was assigned to but be careful here because these labels are not the actual labels of our data because we don't know them so this is just the index of the cluster it was assigned to so yeah so now we iterate over the clusters so for cluster index and also cluster in enumerate clusters and then we iterate over the current cluster so for sample index in cluster so again this is a list of Lists and each cluster has a list has all the the sample indices of the samples that are in this cluster so for each sample index in this current cluster we say labels of the sample index equals the current cluster index and then we return our labels and now we are done so one more thing that I want to implement but you don't need this is to have a plot function so define plot and self and here we are going to use matplotlib so I can say import much plot loop i plot as PLT and i'm not going to explain the details here but if you want to see more tutorials about my plot lip then please leave some comments so let's implement the plot method so here I simply want to plot and take the data and to which a cluster it belongs and then also the centroids so let's create our figure so thick and x equals P LT dots up plots and let's give this a fixed size off size let's say 12 by 8 and then we iterate over our clusters so for I and index in cell in in numerate self dot clusters and now we get the current points of point equals self dot X off the index but we have to transpose it here and now we scatter the point so a X dot scatter and here I unpack at the point and now so this will plot all the points and and for each cluster it will use a different color and now we do the we plot all the centroids so for point in self dot centroids a X dot scatter and again we unfold our point and then we say marker equals x so it has a marker sign and color equals black and line with equals to so and then we have to say PLT dot show of course and so this is the plot function and now during our optimization if we set plot steps to true then we want to plot after we updated our clusters so we say if self dot plot steps then self dot plot and again we also want to plot after we updated our centroids so we put it right here and now we are done so now let's run this and I hope that everything is working so let's clear this and run our script and oh I missed a comma here 937 update centroids of centroids all equals the current ones so again let's try this and now it's working so here we have I created a dataset with four different clusters and we can see that it correctly could you identify them so let's set plot steps to true and don't plot it at the end and now let's run it again to see the different steps so here this is after our initialization so we randomly picked some samples and set these are our first centroids so maybe the initialization is not very good here but it could find out the clusters correctly later so let's see what's happening so now we are so then we are starting to optimize so now in the next step we are calculating the new centroids so I think the the centroid of the blue cluster is moving to maybe over here and the orange centroid is moving to down here so let's look at the next step so yeah and now we are updating the labels so I think these here are no longer orange but red and green so yeah and also some of them are now red and now we update our centroids again so I think this we'll further move over here this will move down here and these will move a little bit so yeah and now we again update our labels so I think more of them will become red and more of them here will become green and so and now again we update our centroids so I think this will further move over here the red centroid will move over here this is already good and the green one or moves over here I think so yep that's that and now again we update the labels now again the centroids and the labels and now I think we are converged so yeah this is how the k-means works and I hope you understood everything and if you liked it please subscribe to the channel and see you next time bye

Original Description

Get my Free NumPy Handbook: https://www.python-engineer.com/numpybook In this Machine Learning from Scratch Tutorial, we are going to implement a K-Means algorithm using only built-in Python modules and numpy. We will also learn about the concept and the math behind this popular ML algorithm. ~~~~~~~~~~~~~~ GREAT PLUGINS FOR YOUR CODE EDITOR ~~~~~~~~~~~~~~ ✅ Write cleaner code with Sourcery: https://sourcery.ai/?utm_source=youtube&utm_campaign=pythonengineer * 📓 Notebooks available on Patreon: https://www.patreon.com/patrickloeber ⭐ Join Our Discord : https://discord.gg/FHMg9tKFSN If you enjoyed this video, please subscribe to the channel! The code can be found here: https://github.com/patrickloeber/MLfromscratch You can find me here: Website: https://www.python-engineer.com Twitter: https://twitter.com/patloeber GitHub: https://github.com/patrickloeber #Python #MachineLearning ---------------------------------------------------------------------------------------------------------- * This is a sponsored link. By clicking on it you will not have any additional costs, instead you will support me and my project. Thank you so much for the support! 🙏

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Patrick Loeber · Patrick Loeber · 33 of 60

← Previous Next →

Lists in Python - Advanced Python 01 - Programming Tutorial

Lists in Python - Advanced Python 01 - Programming Tutorial

Tuples in Python - Advanced Python 02 - Programming Tutorial

Tuples in Python - Advanced Python 02 - Programming Tutorial

Dictionaries in Python - Advanced Python 03 - Programming Tutorial

Dictionaries in Python - Advanced Python 03 - Programming Tutorial

Sets in Python - Advanced Python 04 - Programming Tutorial

Sets in Python - Advanced Python 04 - Programming Tutorial

Strings in Python - Advanced Python 05 - Programming Tutorial

Strings in Python - Advanced Python 05 - Programming Tutorial

Collections in Python - Advanced Python 06 - Programming Tutorial

Collections in Python - Advanced Python 06 - Programming Tutorial

Itertools in Python - Advanced Python 07 - Programming Tutorial

Itertools in Python - Advanced Python 07 - Programming Tutorial

Lambda in Python - Advanced Python 08 - Programming Tutorial - Map Filter Reduce

Lambda in Python - Advanced Python 08 - Programming Tutorial - Map Filter Reduce

Exceptions in Python - Advanced Python 09 - Programming Tutorial

Exceptions in Python - Advanced Python 09 - Programming Tutorial

Logging in Python - Advanced Python 10 - Programming Tutorial

Logging in Python - Advanced Python 10 - Programming Tutorial

JSON in Python - Advanced Python 11 - Programming Tutorial

JSON in Python - Advanced Python 11 - Programming Tutorial

Random Numbers in Python - Advanced Python 12 - Programming Tutorial

Random Numbers in Python - Advanced Python 12 - Programming Tutorial

Decorators in Python - Advanced Python 13 - Programming Tutorial

Decorators in Python - Advanced Python 13 - Programming Tutorial

Generators in Python - Advanced Python 14 - Programming Tutorial

Generators in Python - Advanced Python 14 - Programming Tutorial

Threading vs Multiprocessing in Python - Advanced Python 15 - Programming Tutorial

Threading vs Multiprocessing in Python - Advanced Python 15 - Programming Tutorial

Threading in Python - Advanced Python 16 - Programming Tutorial

Threading in Python - Advanced Python 16 - Programming Tutorial

Multiprocessing in Python - Advanced Python 17 - Programming Tutorial

Multiprocessing in Python - Advanced Python 17 - Programming Tutorial

Function arguments in detail - Advanced Python 18 - Programming Tutorial

Function arguments in detail - Advanced Python 18 - Programming Tutorial

The asterisk (*) operator in Python - Advanced Python 19 - Programming Tutorial

The asterisk (*) operator in Python - Advanced Python 19 - Programming Tutorial

Shallow vs Deep Copying in Python - Advanced Python 20 - Programming Tutorial

Shallow vs Deep Copying in Python - Advanced Python 20 - Programming Tutorial

Context Managers in Python - Advanced Python 21 - Programming Tutorial

Context Managers in Python - Advanced Python 21 - Programming Tutorial

KNN (K Nearest Neighbors) in Python - Machine Learning From Scratch 01 - Python Tutorial

KNN (K Nearest Neighbors) in Python - Machine Learning From Scratch 01 - Python Tutorial

Linear Regression in Python - Machine Learning From Scratch 02 - Python Tutorial

Linear Regression in Python - Machine Learning From Scratch 02 - Python Tutorial

Logistic Regression in Python - Machine Learning From Scratch 03 - Python Tutorial

Logistic Regression in Python - Machine Learning From Scratch 03 - Python Tutorial

Linear and Logistic Regression in 60 lines of Python - Machine Learning From Scratch 04

Linear and Logistic Regression in 60 lines of Python - Machine Learning From Scratch 04

Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial

Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial

Perceptron in Python - Machine Learning From Scratch 06 - Python Tutorial

Perceptron in Python - Machine Learning From Scratch 06 - Python Tutorial

SVM (Support Vector Machine) in Python - Machine Learning From Scratch 07 - Python Tutorial

SVM (Support Vector Machine) in Python - Machine Learning From Scratch 07 - Python Tutorial

Decision Tree in Python Part 1/2 - Machine Learning From Scratch 08 - Python Tutorial

Decision Tree in Python Part 1/2 - Machine Learning From Scratch 08 - Python Tutorial

Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial

Decision Tree in Python Part 2/2 - Machine Learning From Scratch 09 - Python Tutorial

Random Forest in Python - Machine Learning From Scratch 10 - Python Tutorial

Random Forest in Python - Machine Learning From Scratch 10 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial

K-Means Clustering in Python - Machine Learning From Scratch 12 - Python Tutorial

Anaconda Tutorial - Installation and Basic Commands

Anaconda Tutorial - Installation and Basic Commands

PyTorch Tutorial 01 - Installation

PyTorch Tutorial 01 - Installation

PyTorch Tutorial 02 - Tensor Basics

PyTorch Tutorial 02 - Tensor Basics

PyTorch Tutorial 03 - Gradient Calculation With Autograd

PyTorch Tutorial 03 - Gradient Calculation With Autograd

PyTorch Tutorial 04 - Backpropagation - Theory With Example

PyTorch Tutorial 04 - Backpropagation - Theory With Example

PyTorch Tutorial 05 - Gradient Descent with Autograd and Backpropagation

PyTorch Tutorial 05 - Gradient Descent with Autograd and Backpropagation

PyTorch Tutorial 06 - Training Pipeline: Model, Loss, and Optimizer

PyTorch Tutorial 06 - Training Pipeline: Model, Loss, and Optimizer

PyTorch Tutorial 07 - Linear Regression

PyTorch Tutorial 07 - Linear Regression

PyTorch Tutorial 08 - Logistic Regression

PyTorch Tutorial 08 - Logistic Regression

PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training

PyTorch Tutorial 10 - Dataset Transforms

PyTorch Tutorial 10 - Dataset Transforms

Download Images With Python Automatically - Python Web Scraping Tutorial

Download Images With Python Automatically - Python Web Scraping Tutorial

PyTorch Tutorial 11 - Softmax and Cross Entropy

PyTorch Tutorial 11 - Softmax and Cross Entropy

Select Movies with Python - Web Scraping Tutorial

Select Movies with Python - Web Scraping Tutorial

PyTorch Tutorial 12 - Activation Functions

PyTorch Tutorial 12 - Activation Functions

List Comprehension in Python - A Python Feature You MUST KNOW - Python Tutorial

List Comprehension in Python - A Python Feature You MUST KNOW - Python Tutorial

PyTorch Tutorial 13 - Feed-Forward Neural Network

PyTorch Tutorial 13 - Feed-Forward Neural Network

How To Add A Progress Bar In Python With Just One Line - Python Tutorial

How To Add A Progress Bar In Python With Just One Line - Python Tutorial

PyTorch Tutorial 14 - Convolutional Neural Network (CNN)

PyTorch Tutorial 14 - Convolutional Neural Network (CNN)

The Walrus Operator - New in Python 3.8 - Python Tutorial

The Walrus Operator - New in Python 3.8 - Python Tutorial

PyTorch Tutorial 15 - Transfer Learning

PyTorch Tutorial 15 - Transfer Learning

YouTube Data API Tutorial with Python - Analyze Channel Statistics - Part 1

YouTube Data API Tutorial with Python - Analyze Channel Statistics - Part 1

YouTube Data API Tutorial with Python - Find Channel Videos - Part 2

YouTube Data API Tutorial with Python - Find Channel Videos - Part 2

YouTube Data API Tutorial with Python - Get Video Statistics - Part 3

YouTube Data API Tutorial with Python - Get Video Statistics - Part 3

YouTube Data API Tutorial with Python - Analyze the Data - Part 4

YouTube Data API Tutorial with Python - Analyze the Data - Part 4

AdaBoost in Python - Machine Learning From Scratch 13 - Python Tutorial

AdaBoost in Python - Machine Learning From Scratch 13 - Python Tutorial

Ultimate FREE Study Guide for Machine Learning and Deep Learning

Ultimate FREE Study Guide for Machine Learning and Deep Learning

This video tutorial teaches the implementation of the K-Means clustering algorithm using Python and NumPy, covering the concept, math, and iterative optimization technique behind this popular ML algorithm. By following the steps outlined in the tutorial, viewers can learn how to apply K-Means clustering to datasets and determine convergence of the clustering algorithm.

Key Takeaways

Initialize cluster centers by randomly picking samples
Update cluster labels by assigning points to the nearest cluster center
Update centroids by setting the new center to the mean of each cluster
Repeat steps until converged
Calculate Euclidean distance between old and new centroids for all centroids
Check if the sum of distances is zero to determine convergence
Return the cluster labels by iterating over the clusters and assigning each sample to its corresponding cluster index

💡 The K-Means clustering algorithm is an iterative optimization technique that updates cluster labels and centroids until converged, and can be implemented using Python and NumPy.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Unsupervised Learning

View skill →

How to implement K-Means from scratch with Python

How to implement K-Means from scratch with Python

K-Means Clustering - The Math of Intelligence (Week 3)

K-Means Clustering - The Math of Intelligence (Week 3)

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Self-/Unsupervised GNN Training

Self-/Unsupervised GNN Training

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Clustering with DBSCAN, Clearly Explained!!!

Clustering with DBSCAN, Clearly Explained!!!

StatQuest with Josh Starmer

Related AI Lessons

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Why Your Python Loops Are Creating the Wrong Functions

Learn why Python loops create functions with the same value and how to fix it using default argument capture and factory functions

Answer Calculator: Step-by-Step Math Help

Learn to use an Answer Calculator for step-by-step math help, making it a valuable tool for late-night studying or work

Learn Deep Learning by Hand (Beginner's Guide - Part 1)