StatQuest: Hierarchical Clustering

StatQuest with Josh Starmer · Beginner ·📐 ML Fundamentals ·9y ago

Key Takeaways

The video demonstrates hierarchical clustering, a technique used to order rows and columns based on similarity, often used with heatmaps and machine learning, and explains the process of determining similarity using distance metrics such as Euclidean distance and UKian distance.

Full Transcript

[Music] going on a quest on a stat Quest stat Quest hello and welcome to stat Quest today we're going to be talking about hierarchical clustering hierarchical clustering is often associated with heat Maps if you're not already familiar with what heat maps are just know that the columns typically represent different samples and that the rows typically represent measurements from different genes red typically signifies High expression of a gene and blue or purple means lower expression for a gene hierarchical clustering orders the rows and or the columns based on similarity this makes it easy to see correlation in the data for example these samples express the same genes and these genes behave the same on the left we have a heat map without hierarchical clustering and on the right we have a heat map with hierarchical clustering so you can see that the clustering makes a big difference on how the data is presented heat Maps often come with dendrograms so we'll talk about those too let's get started we'll start with a simple example here we've got a simple heat map that has three samples and four genes for this example we are just going to Cluster or reorder the rows or the genes conceptually the first step is to figure out which Gene is most similar to Gene number one genes number one and two are different we can tell because the colors are very different Gene one is highly expressed in Sample number one so it has a red color Gene 2 however is not highly expressed on Sample number one so it has a blue color in Sample number three Gene one is lowly expressed so it's blue and Gene 2 is highly expressed so it's red genes 1 and three are similar so that means in Sample one both Gene 1 and three are red they're highly expressed and in Sample three they're both blue meaning they're lowly expressed genes one and four are also similar however Gene number one is most similar to Gene number three so the second step is to figure out what Gene is most similar to Gene number two so we do all the comparisons and we see that Gene number two is most similar to Gene number four and then we do the same thing for Gene number three and then Gene number four in Step number three we look at the different combinations and figure out which two genes are the most similar once we've done that we merge them into a cluster in this case genes number one and three are more similar than any other combination of genes so genes 1 and three are now cluster number one step four go back to step one but now treat the new cluster like it's a single Gene so in step one we figure out which Gene is most similar to Cluster number one cluster number one is most similar to Gene number four and we figure out which Gene is most similar to Gene number two in this case Gene number two is most similar to Gene number four but notice that we compared Gene number two to Cluster number one and then we do the same thing for Gene number four of the different combinations figure out which two genes are the most similar now merge them into a cluster in this case genes 2 and four are the most similar combination so we've merged them into a cluster now we go back to Step One however since all we have left are two clusters we merge them bam we're all done hierarchical clustering is usually accompanied by a dendrogram it indicates both the similarity and the order that the Clusters were formed cluster number one was formed first and is is most similar it has the shortest Branch cluster number two was second and is the second most similar it has the second shortest Branch cluster number three which contains all of the genes was formed last it has the longest Branch now let's go over a few nitpicky details remember the first step figure out which Gene is most similar to Gene number one well we have to Define what most similar means the method for determining similarity is arbitrarily chosen however the ukian distance between genes is used a lot let's look at an example we'll use a very simple heat map that just has two samples and two genes now we're displaying the values that underly the the colors that we have in the heat map the ukian distance between genes 1 and two is just the square root of the difference in Sample number one squared plus the difference in Sample number two squared here we'll just plug in the values for sample number one we have 1.6 minus 0.5 now let's plug in the values to calculate the difference in Sample number two we have 0.5 minus -1.9 doing the subtraction gives us the square < TK of 2.12 + 2.4 2ar we can think of these values within the parentheses as sides on a triangle so on the x axis we have the distance between Gene 1 and Gene 2 in Sample number one and on the Y AIS we have the distance between Gene 1 1 and two in Sample number two the hypotenuse is the distance between genes 1 and two the Pythagorean theorem says that the hypotenuse equals theare < TK of x^2 + y^2 in this case that means the Square t of 2.12 + 2.4 SAR and that gives us 3.2 the distance between Gene number one and Gene number two when we have more samples we just extend the equation it's no big deal the ukian distance is just one method there are lots more including the Manhattan distance the Manhattan distance is just the absolute value of the differences so instead of squaring the differences and then taking the square root all we do is take the absolute value of the differences we can think of the Manhattan distance in geometric terms by imagining that each difference is a line segment if we take all those line segments and put them together head to tail head to tail and then add that total length of all those line segments together that's the Manhattan distance yes it makes a difference here's a heat map Drawn using the ukian distance and here's the same information drawn as a heat map but now we're using the Manhattan distance the heat maps are very similar but there are also a few differences the choice and distance metric is arbitrary W there is no biological or physical reason to choose one and not the other pick the one that gives you more insight into your data now do you remember how we merged genes 1 and three into cluster number one and compared it to other genes well there are different ways to compare clusters too one simple idea is to use the average of the measurements from each sample but there are lots more and these have effect on clustering as well so let's talk about the different ways to compare clusters for the sake of visualizing how the different methods work imagine our data was spread out on an XY plane now imagine that we have already formed these two clusters and we just want to figure out which cluster this last Point belongs to we can compare that point to the average of each cluster this is called the centroid the closest point in each cluster this is called single linkage or we can compare it to the furthest point in each cluster this is called complete linkage and there are other methods as well here's a heat map that compares the furthest points in the clusters by the way if you use R this is the default setting for the hclust function this heat map compares the average points in the Clusters and this last Heat Map compares the closest points in the Clusters these heat maps are all very similar but there are also differences in the way the data is presented in some summary clusters are formed based on some notion of similarity you have to decide what that is however most programs have reasonable defaults once you have a subcluster you have to decide how it should be compared to other rows columns or subclusters Etc and most programs have good default settings for this as well and the height of the branches in the Dinger gram shows you what is most simple similar hooray we've made it to the end of another exciting stat Quest if you liked this presentation please subscribe to my channel and you'll get more like it also if you'd like me to do something specific feel free to mention it in the comments below

Original Description

Hierarchical clustering is often used with heatmaps and with machine learning type stuff. It's no big deal, though, and based on just a few simple concepts. If you want to draw a heatmap using R, I've put some sample code on my webiste: https://statquest.org/statquest-hierarchical-clustering/ For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store... https://statquest.org/statquest-store/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer #statquest #ML #clustering
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from StatQuest with Josh Starmer · StatQuest with Josh Starmer · 47 of 60

1 Cutting Butter
Cutting Butter
StatQuest with Josh Starmer
2 onion-dice
onion-dice
StatQuest with Josh Starmer
3 R-squared, Clearly Explained!!!
R-squared, Clearly Explained!!!
StatQuest with Josh Starmer
4 Wrapping up dumplings for pot stickers.
Wrapping up dumplings for pot stickers.
StatQuest with Josh Starmer
5 The standard error, Clearly Explained!!!
The standard error, Clearly Explained!!!
StatQuest with Josh Starmer
6 That Dude (in the movies)
That Dude (in the movies)
StatQuest with Josh Starmer
7 How to puree garlic
How to puree garlic
StatQuest with Josh Starmer
8 Confidence Intervals, Clearly Explained!!!
Confidence Intervals, Clearly Explained!!!
StatQuest with Josh Starmer
9 RPKM, FPKM and TPM, Clearly Explained!!!
RPKM, FPKM and TPM, Clearly Explained!!!
StatQuest with Josh Starmer
10 Principal Component Analysis (PCA) clearly explained (2015)
Principal Component Analysis (PCA) clearly explained (2015)
StatQuest with Josh Starmer
11 StatQuest: RNA-seq - the problem with technical replicates
StatQuest: RNA-seq - the problem with technical replicates
StatQuest with Josh Starmer
12 That's Alright
That's Alright
StatQuest with Josh Starmer
13 Christmas In Rio! (now on iTunes!)
Christmas In Rio! (now on iTunes!)
StatQuest with Josh Starmer
14 Drawing and Interpreting Heatmaps
Drawing and Interpreting Heatmaps
StatQuest with Josh Starmer
15 Rachel's Song (the ballad of Hazel Motes)
Rachel's Song (the ballad of Hazel Motes)
StatQuest with Josh Starmer
16 Deal With It
Deal With It
StatQuest with Josh Starmer
17 Say Your Goodbyes
Say Your Goodbyes
StatQuest with Josh Starmer
18 Another Day
Another Day
StatQuest with Josh Starmer
19 StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
StatQuest with Josh Starmer
20 Maybe It'll Go Away
Maybe It'll Go Away
StatQuest with Josh Starmer
21 Nasty Weather
Nasty Weather
StatQuest with Josh Starmer
22 Roses
Roses
StatQuest with Josh Starmer
23 p-hacking and power calculations
p-hacking and power calculations
StatQuest with Josh Starmer
24 I Love You
I Love You
StatQuest with Josh Starmer
25 The Coldest Day of the Year
The Coldest Day of the Year
StatQuest with Josh Starmer
26 Psycho Killer
Psycho Killer
StatQuest with Josh Starmer
27 False Discovery Rates, FDR, clearly explained
False Discovery Rates, FDR, clearly explained
StatQuest with Josh Starmer
28 A New Song
A New Song
StatQuest with Josh Starmer
29 StatQuickie: Thresholds for Significance
StatQuickie: Thresholds for Significance
StatQuest with Josh Starmer
30 Logs (logarithms), Clearly Explained!!!
Logs (logarithms), Clearly Explained!!!
StatQuest with Josh Starmer
31 Bar Charts Are Better than Pie Charts
Bar Charts Are Better than Pie Charts
StatQuest with Josh Starmer
32 Mr  Hattie
Mr Hattie
StatQuest with Josh Starmer
33 StatQuickie: Which t test to use
StatQuickie: Which t test to use
StatQuest with Josh Starmer
34 Fisher's Exact Test and the Hypergeometric Distribution
Fisher's Exact Test and the Hypergeometric Distribution
StatQuest with Josh Starmer
35 Standard Deviation vs Standard Error, Clearly Explained!!!
Standard Deviation vs Standard Error, Clearly Explained!!!
StatQuest with Josh Starmer
36 StatQuest: DESeq2, part 1, Library Normalization
StatQuest: DESeq2, part 1, Library Normalization
StatQuest with Josh Starmer
37 The Rainbow
The Rainbow
StatQuest with Josh Starmer
38 StatQuest: edgeR, part 1, Library Normalization
StatQuest: edgeR, part 1, Library Normalization
StatQuest with Josh Starmer
39 The Main Ideas behind Probability Distributions
The Main Ideas behind Probability Distributions
StatQuest with Josh Starmer
40 StatQuest:  One or Two Tailed P-Values
StatQuest: One or Two Tailed P-Values
StatQuest with Josh Starmer
41 Evil Genius
Evil Genius
StatQuest with Josh Starmer
42 Sampling from a Distribution, Clearly Explained!!!
Sampling from a Distribution, Clearly Explained!!!
StatQuest with Josh Starmer
43 StatQuest: edgeR and DESeq2, part 2 - Independent Filtering
StatQuest: edgeR and DESeq2, part 2 - Independent Filtering
StatQuest with Josh Starmer
44 The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
StatQuest with Josh Starmer
45 The Sum of Regrets
The Sum of Regrets
StatQuest with Josh Starmer
46 Lowess and Loess, Clearly Explained!!!
Lowess and Loess, Clearly Explained!!!
StatQuest with Josh Starmer
StatQuest: Hierarchical Clustering
StatQuest: Hierarchical Clustering
StatQuest with Josh Starmer
48 StatQuest: K-nearest neighbors, Clearly Explained
StatQuest: K-nearest neighbors, Clearly Explained
StatQuest with Josh Starmer
49 Your Dark Side
Your Dark Side
StatQuest with Josh Starmer
50 Boxplots are Awesome!!!
Boxplots are Awesome!!!
StatQuest with Josh Starmer
51 What is a (mathematical) model?
What is a (mathematical) model?
StatQuest with Josh Starmer
52 Linear Regression, Clearly Explained!!!
Linear Regression, Clearly Explained!!!
StatQuest with Josh Starmer
53 Linear Regression in R, Step-by-Step
Linear Regression in R, Step-by-Step
StatQuest with Josh Starmer
54 Maximum Likelihood, clearly explained!!!
Maximum Likelihood, clearly explained!!!
StatQuest with Josh Starmer
55 Brothers
Brothers
StatQuest with Josh Starmer
56 Using Linear Models for t-tests and ANOVA, Clearly Explained!!!
Using Linear Models for t-tests and ANOVA, Clearly Explained!!!
StatQuest with Josh Starmer
57 StatQuest: How to make a Mean Pizza Crust!!!
StatQuest: How to make a Mean Pizza Crust!!!
StatQuest with Josh Starmer
58 StatQuest: A gentle introduction to RNA-seq
StatQuest: A gentle introduction to RNA-seq
StatQuest with Josh Starmer
59 I'm Alive
I'm Alive
StatQuest with Josh Starmer
60 StatQuest: t-SNE, Clearly Explained
StatQuest: t-SNE, Clearly Explained
StatQuest with Josh Starmer

This video teaches hierarchical clustering, a technique used to visualize correlation in data, and explains how to determine similarity using distance metrics. It provides a step-by-step guide on how to apply hierarchical clustering and interpret the results.

Key Takeaways
  1. Find the most similar gene to a given gene
  2. Merge genes into clusters
  3. Go back to the first step with the new cluster
  4. Repeat the process until all genes are in a single cluster
  5. Choose a distance metric such as Euclidean distance or UKian distance
  6. Decide on a method for comparing clusters such as single linkage or complete linkage
💡 The choice of distance metric is arbitrary and depends on the insight it provides into the data.

Related AI Lessons

The Python Dictionary Trick That Makes Interviewers Smile
Learn the Python dictionary trick that impresses interviewers and improves your coding skills
Dev.to · Ameer Abdullah
I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026
Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects
Medium · Python
Machine learning for beginners #5
Learn the basics of machine learning through the analysis of self-driving cars and understand how ML is applied in real-world scenarios
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development
Medium · AI
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →