StatQuest: K-nearest neighbors, Clearly Explained

StatQuest with Josh Starmer · Beginner ·📄 Research Papers Explained ·9y ago

Key Takeaways

The video explains the K-nearest neighbors algorithm, a simple and effective method for classifying data, using examples with PCA and hierarchical clustering.

Full Transcript

[Music] St Quest St Quest stack Quest hello and welcome to stack Quest stack Quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill today we're going to be talking about the K nearest neighbors algorithm which is a super simple way to classify data in a nutshell if you already had a lot of data that Define these cell types we could use it to decide which type of cell this guy is let's see it in action step one start with a data set with known categories in this case we have different cell types from an intestinal tumor we then cluster that data in this case we used PCA step two add a new cell with unknown category to the plot we don't know this cell's category because it was taken from another tumor where the cells were not properly sorted and so what we want to do is we want to classify this new cell we want to figure out what cell it's most similar to and then we're going to call it that type of cell step three we classify the new cell by looking at the nearest nearest annotated cells I.E the nearest neighbors if the K in K nearest neighbors is equal to one then we will only use the nearest neighbor to define the category in this case the category is green because the nearest neighbor is already known to be the green cell type if k equals 11 we would use the 11 nearest Neighbors in this case the category is still green because the 11 cells that are closest to the unknown cell are already green now the new cell is somewhere more interesting it's about halfway between the green and the red cells if k equals 11 and the new cells between two or more categories we simply pick the category that gets the most votes in this case seven nearest neighbors are red three nearest neighbors are orange one nearest neighbor is green since red got the most votes the final assignment is red this same principle applies to heat Maps this heat map was drawn with the same data and clustered using hierarchical clustering if our new cell ended up in the middle of the light blue cluster and if k equals 1 we just look at the nearest cell and that cell is light blue so we classify the unknown cell as a light blue cell if k equals 5 we'd look at the five nearest cells which are also light blue so we'd still classify the unknown cell as light blue if the new cell ended up closer to the edge of the light blue cells and k equals 11 then we take a vote seven nearest neighbors are light blue and four are light green so we'd still go with light blue if the new cell is right between two categories well if K is odd then we can avoid a lot of ties if we still get a tied vote we can flip a coin or decide not to assign the cell to a category before we go let's talk about a little machine learning SL data mining terminology the data used for the initial clustering the data where we know the categories in advance is called training data bam a few thoughts on picking a value for K there is no physical or biological way to determine the best value for K so you may have to try out a few values before settling on one do this by pretending part of the training data is unknown and then what you do is you categorize that unknown data using the K nearest neighbor algorithm and you assess how good the new categories match what you know already low values for K like k equal 1 or k equals 2 can be noisy and subject to the effects of outliers large values for K smooth over things but you don't want K to be so large that a c category with only a few samples in it will always be outvoted by other categories hooray we've made it to the end of another exciting stack Quest if you like this stack Quest go ahead and subscribe to my channel and you'll see more like it and if you have any ideas of things you'd like me to do a stack Quest on feel free to put those ideas in the comments okay guess that's it tune in next time for another exciting stag Quest

Original Description

Machine learning and Data Mining sure sound like complicated things, but that isn't always the case. Here we talk about the surprisingly simple and surprisingly effective K-nearest neighbors algorithm. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store... https://statquest.org/statquest-store/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer 0:00 Awesome song and introduction 0:21 K-NN overview 0:44 K-NN applied to scatterplot data 2:44 K-NN applied to a heatmap 4:12 Thoughts on how to pick 'K' #statquest #KNN #ML
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from StatQuest with Josh Starmer · StatQuest with Josh Starmer · 48 of 60

1 Cutting Butter
Cutting Butter
StatQuest with Josh Starmer
2 onion-dice
onion-dice
StatQuest with Josh Starmer
3 R-squared, Clearly Explained!!!
R-squared, Clearly Explained!!!
StatQuest with Josh Starmer
4 Wrapping up dumplings for pot stickers.
Wrapping up dumplings for pot stickers.
StatQuest with Josh Starmer
5 The standard error, Clearly Explained!!!
The standard error, Clearly Explained!!!
StatQuest with Josh Starmer
6 That Dude (in the movies)
That Dude (in the movies)
StatQuest with Josh Starmer
7 How to puree garlic
How to puree garlic
StatQuest with Josh Starmer
8 Confidence Intervals, Clearly Explained!!!
Confidence Intervals, Clearly Explained!!!
StatQuest with Josh Starmer
9 RPKM, FPKM and TPM, Clearly Explained!!!
RPKM, FPKM and TPM, Clearly Explained!!!
StatQuest with Josh Starmer
10 Principal Component Analysis (PCA) clearly explained (2015)
Principal Component Analysis (PCA) clearly explained (2015)
StatQuest with Josh Starmer
11 StatQuest: RNA-seq - the problem with technical replicates
StatQuest: RNA-seq - the problem with technical replicates
StatQuest with Josh Starmer
12 That's Alright
That's Alright
StatQuest with Josh Starmer
13 Christmas In Rio! (now on iTunes!)
Christmas In Rio! (now on iTunes!)
StatQuest with Josh Starmer
14 Drawing and Interpreting Heatmaps
Drawing and Interpreting Heatmaps
StatQuest with Josh Starmer
15 Rachel's Song (the ballad of Hazel Motes)
Rachel's Song (the ballad of Hazel Motes)
StatQuest with Josh Starmer
16 Deal With It
Deal With It
StatQuest with Josh Starmer
17 Say Your Goodbyes
Say Your Goodbyes
StatQuest with Josh Starmer
18 Another Day
Another Day
StatQuest with Josh Starmer
19 StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
StatQuest with Josh Starmer
20 Maybe It'll Go Away
Maybe It'll Go Away
StatQuest with Josh Starmer
21 Nasty Weather
Nasty Weather
StatQuest with Josh Starmer
22 Roses
Roses
StatQuest with Josh Starmer
23 p-hacking and power calculations
p-hacking and power calculations
StatQuest with Josh Starmer
24 I Love You
I Love You
StatQuest with Josh Starmer
25 The Coldest Day of the Year
The Coldest Day of the Year
StatQuest with Josh Starmer
26 Psycho Killer
Psycho Killer
StatQuest with Josh Starmer
27 False Discovery Rates, FDR, clearly explained
False Discovery Rates, FDR, clearly explained
StatQuest with Josh Starmer
28 A New Song
A New Song
StatQuest with Josh Starmer
29 StatQuickie: Thresholds for Significance
StatQuickie: Thresholds for Significance
StatQuest with Josh Starmer
30 Logs (logarithms), Clearly Explained!!!
Logs (logarithms), Clearly Explained!!!
StatQuest with Josh Starmer
31 Bar Charts Are Better than Pie Charts
Bar Charts Are Better than Pie Charts
StatQuest with Josh Starmer
32 Mr  Hattie
Mr Hattie
StatQuest with Josh Starmer
33 StatQuickie: Which t test to use
StatQuickie: Which t test to use
StatQuest with Josh Starmer
34 Fisher's Exact Test and the Hypergeometric Distribution
Fisher's Exact Test and the Hypergeometric Distribution
StatQuest with Josh Starmer
35 Standard Deviation vs Standard Error, Clearly Explained!!!
Standard Deviation vs Standard Error, Clearly Explained!!!
StatQuest with Josh Starmer
36 StatQuest: DESeq2, part 1, Library Normalization
StatQuest: DESeq2, part 1, Library Normalization
StatQuest with Josh Starmer
37 The Rainbow
The Rainbow
StatQuest with Josh Starmer
38 StatQuest: edgeR, part 1, Library Normalization
StatQuest: edgeR, part 1, Library Normalization
StatQuest with Josh Starmer
39 The Main Ideas behind Probability Distributions
The Main Ideas behind Probability Distributions
StatQuest with Josh Starmer
40 StatQuest:  One or Two Tailed P-Values
StatQuest: One or Two Tailed P-Values
StatQuest with Josh Starmer
41 Evil Genius
Evil Genius
StatQuest with Josh Starmer
42 Sampling from a Distribution, Clearly Explained!!!
Sampling from a Distribution, Clearly Explained!!!
StatQuest with Josh Starmer
43 StatQuest: edgeR and DESeq2, part 2 - Independent Filtering
StatQuest: edgeR and DESeq2, part 2 - Independent Filtering
StatQuest with Josh Starmer
44 The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
StatQuest with Josh Starmer
45 The Sum of Regrets
The Sum of Regrets
StatQuest with Josh Starmer
46 Lowess and Loess, Clearly Explained!!!
Lowess and Loess, Clearly Explained!!!
StatQuest with Josh Starmer
47 StatQuest: Hierarchical Clustering
StatQuest: Hierarchical Clustering
StatQuest with Josh Starmer
StatQuest: K-nearest neighbors, Clearly Explained
StatQuest: K-nearest neighbors, Clearly Explained
StatQuest with Josh Starmer
49 Your Dark Side
Your Dark Side
StatQuest with Josh Starmer
50 Boxplots are Awesome!!!
Boxplots are Awesome!!!
StatQuest with Josh Starmer
51 What is a (mathematical) model?
What is a (mathematical) model?
StatQuest with Josh Starmer
52 Linear Regression, Clearly Explained!!!
Linear Regression, Clearly Explained!!!
StatQuest with Josh Starmer
53 Linear Regression in R, Step-by-Step
Linear Regression in R, Step-by-Step
StatQuest with Josh Starmer
54 Maximum Likelihood, clearly explained!!!
Maximum Likelihood, clearly explained!!!
StatQuest with Josh Starmer
55 Brothers
Brothers
StatQuest with Josh Starmer
56 Using Linear Models for t-tests and ANOVA, Clearly Explained!!!
Using Linear Models for t-tests and ANOVA, Clearly Explained!!!
StatQuest with Josh Starmer
57 StatQuest: How to make a Mean Pizza Crust!!!
StatQuest: How to make a Mean Pizza Crust!!!
StatQuest with Josh Starmer
58 StatQuest: A gentle introduction to RNA-seq
StatQuest: A gentle introduction to RNA-seq
StatQuest with Josh Starmer
59 I'm Alive
I'm Alive
StatQuest with Josh Starmer
60 StatQuest: t-SNE, Clearly Explained
StatQuest: t-SNE, Clearly Explained
StatQuest with Josh Starmer

The K-nearest neighbors algorithm is a simple method for classifying data by finding the nearest neighbors to a new data point. The value of K can be adjusted to balance noise and smoothing.

Key Takeaways
  1. Start with a dataset with known categories
  2. Cluster the data using PCA or hierarchical clustering
  3. Add a new data point with unknown category
  4. Classify the new data point by looking at the K nearest neighbors
  5. Adjust the value of K to optimize performance
💡 The choice of K is crucial in the K-nearest neighbors algorithm, and there is no one-size-fits-all solution

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning

Chapters (5)

Awesome song and introduction
0:21 K-NN overview
0:44 K-NN applied to scatterplot data
2:44 K-NN applied to a heatmap
4:12 Thoughts on how to pick 'K'
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →