Statistical Learning: 12.6 Breast Cancer Example
Statistical Learning, featuring Deep Learning, Survival Analysis and Multiple Testing
Trevor Hastie, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statistics.stanford.edu/people/trevor-j-hastie
Robert Tibshirani, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statistics.stanford.edu/people/robert-tibshirani
Jonathan Taylor, Professor Statistics at Stanford University - https://statistics.stanford.edu/people/jonathan-taylor
You are able to take Statistical Learning as an online course on EdX, and you are able to choose a verified path and get a certificate for its completion. You can choose to take the course in R (https://www.edx.org/course/statistica) or in Python (https://www.edx.org/learn/data-analysis-statistics/stanford-university-statistical-learning-with-python)
For more information about courses on Statistics, you can browse our Stanford Online Catalog: https://stanford.io/3QHRi72
What You'll Learn
The video discusses statistical learning using the example of breast cancer, covering topics such as hierarchical clustering, gene expression, and principal components, with a focus on unsupervised learning techniques like clustering and dimensionality reduction.
Full Transcript
welcome back in this in this the last segment of this section we're going to we're going to see an example of hierarchical clustering applied to a study of breast cancer so this is the last part of the segment this is an example which actually Trevor and I are both involved in um actually 10 years ago now with a postdoc at Stanford in oncology Therese sorely um Therese had measured gene expression from Gene chips for about 88 women who had breast cancer were being treated for breast cancer and Gene special measures for about 8 000 genes so what that means is for for each of the 88 patients there's a quantitative measurement for each of eight thousand genes um which measures how how much that Gene was expressing how active it was for that woman right and this is a very common kind of study now where people look at gene expression to try to understand the basis of diseases like breast cancer and figure out whether there are subtypes of of the disease which should be treated in a different way so this is quite large the amount of data where 88 patients 8 000 features she used or we the group used um average Nike's with correlation metric again because this is the case where genes are in the same units in a sense but they're measured in the same units but um the actual level of gene expression wasn't very reliable because it varies across the way it's measured but was more thought to be more important was the shape of the the relative expression of different genes for the same patient so that's why we use correlation metric um and we did harmful clustering of the samples of the 88 patients now when Therese first used the full set of genes the clustering she got out wasn't uh wasn't satisfactory now what does that mean well it's again it's very subjective but uh it didn't it wasn't very informative to her to to Therese and her collaborators so rather they use a subset of the genes called the intrinsic Gene so this is a a way of choosing a most more informative subset of genes and I'll go into the detail except to say in words in this particular study these women were given chemotherapy and there was actually a sample taken before and after for each woman and gene expression measurements were available before and after so what Therese did was she she defined what called intrinsic genes so for each woman for each gene if each woman we look to see which genes had the smallest variation within a woman within a woman as opposed to between the 88 women and the ones with the smallest variation were were defined to be the intrinsic genes of 500 most this is the 500 genes with the lowest variation the idea being again this is a biological concept was that genes which didn't vary much in a woman before and after chemotherapy compared to the the between one variation were thought to be intrinsic to her cell biology so they are thought to be the ones that could could best Drive the clustering and and separate the women in terms of their biology and maybe their response treatment they varied a lot between women but little within women across the two repeated measures so doing that we uh we got the following clustering so what do we see here first of all this is the um here are the 500 or so intrinsic genes and this is called a heat map and this is a common display for this kind of data so what do we see here each row of the heat map is a gene 500 sub genes each column is a woman one of the 88 women and each pixel is displayed as either uh green which is negative so the gene expression it's normalized so it runs from something like -5 to plus five so green would be negative and red is positive so green means the gene expression for that Gene for that woman is lower than average and red means it's higher than average and what's been done here is we applied hierarchical clustering to the columns that's the women in the way I just described in addition hierarchical question was done to the rows the genes this is done in both directions and that's why this picture looks it's got It's uh has patches of red and green right because it looks because we've sorted them basically we've sorted the hierarchical clustering and we sorted the observations by the order of the leaves in the tree both for genes and both for and for samples and that's why if we just display the data in the order we've obtained it this picture would not look so nice so it would be a checkerboard pattern it would look look very random but see here here it looks much more structured because of the clustering has been applied and The Ordering of the leaves is it's been used to reorder the rows and columns um just to this actually this kind of display would actually start used first at Stanford in the in the genomics labs and around the time the the gene chip is invented which is also done partly here at Stanford and I think this has become very attractive just because it's just it's a nice way just to see all the data right if you have if you're given uh a data set of 88 observations women and in 8 000 genes that's a lot of data just to even look at and so the first kind of challenge is how do I just make a display so I can look at all the data and see that the gross patterns and here this is actually a very effective display this is one of the pioneering if it's in the in the labs of Patrick Brown and David botstein with gene expression really started so you know it's kind of sort of fun to think that a you know a a pioneering um piece of science is actually displayed but that's often the case right some things some very simple things which might be seem trivial actually can have a lot of impact just in this case just the ability to look at to arrange and display the data informatively was very useful and it's still used a lot today so here's the full heat map and and then the the clustering tree is at the top this is here and it's been expanded out here and um it's it's been divided into one two three four five six seven eight clusters um the gray is just basically a unknown group but the other clusters have been labeled by names like normal bazel erb2 luminal a luminal b these names were chosen by Therese and collaborators based on the genes that we're expressing in the group in the groups so now if we look at this picture what we've taken is the same clusters and we've just taken subsets of the rows that's these five groups and these are genes which are expressing highly in one or more of these key groups like for example here let's see um this block of sea genes is expressing highly in the red group and the blue group this block of D genes expressing highly in these clusters Etc so then uh the oncologist will will look at this and they'll try to understand well how are the so these groups are different with respect to these particular genes what do these genes do in the cells and um what does it tell us about the these these subgroups in particular let's onto the the last display of this if you look at these subgroups you look at their survival of these women these are called copper survival curves right um these are these women were treated with cancer for cancer and followed up to see how and hopefully you know uh recovered some didn't and the survival curves of the groups are given here so for example uh the basal group I believe the red and the purple which groups are those Basils uh basil and herbie2 are doing not nearly as well they're probably survival is much worse whereas the um this group did not group The luminal a is doing much better so because the survival is quite different the scientists were really wanting to find out how are these groups different and with respect to what genes and that gives us a clue as to how the diseases might be different in the different groups so that's example of clustering for in a real scientific problem that's of importance so just to wrap up this section now unsupervised learning is what has been the topic we've talked about principal components and clustering uh and that's they're important in general for for understanding the variation and grouping structure of a set of unlabeled data so they can be useful for themselves um just by themselves that we saw for example in the that last example or as a preprocessor to choose a linear combination of features for supervised learning um and we also saw that the problem is intrinsically harder than supervised learning because there's no there's no label there's no gold standard so we can't use you can't use prediction error to figure out how well we're doing we've just shown you two techniques in in these presentations principal components and clustering and those are part of a big tool by a bag of lots of other techniques some of them are listed here like self-organizing Maps independent component analysis spectral clustering and many more many more of these are covered in our book elements of statistical learning in chapter 14 and even beyond that there are many others as well
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Stanford Online · Stanford Online · 7 of 60
1
2
3
4
5
6
▶
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Statistical Learning: 13.2 Introduction to Multiple Testing and Family Wise Error Rate
Stanford Online
Statistical Learning: 13.1 Introduction to Hypothesis Testing II
Stanford Online
Statistical Learning: 12.R.3 Hierarchical Clustering
Stanford Online
Statistical Learning: 12.R.2 K means Clustering
Stanford Online
Statistical Learning: 12.R.1 Principal Components
Stanford Online
Statistical Learning: 13.R.1 Bonferroni and Holm II
Stanford Online
Statistical Learning: 12.6 Breast Cancer Example
Stanford Online
Statistical Learning: 12.5 Matrix Completion
Stanford Online
Statistical Learning: 12.4 Hierarchical Clustering
Stanford Online
Statistical Learning: 12.3 k means Clustering
Stanford Online
Statistical Learning: 13.1 Introduction to Hypothesis Testing
Stanford Online
Stanford Seminar - Introduction to Web3
Stanford Online
Stanford Seminar - Designing Equitable Online Experiences
Stanford Online
Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 1
Stanford Online
Stanford Seminar - Perceiving, Understanding, and Interacting through Touch
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 2
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 3
Stanford Online
Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 4
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 5
Stanford Online
Stanford Seminar - Evolution of a Web3 Company
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 6
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 7
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 8
Stanford Online
Stanford Seminar - Designing Human-Centered AI Systems for Human-AI Collaboration
Stanford Online
The Sh*tFixers: Bob Sutton Interviews David Kelley, Design Thinking Superstar
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 9
Stanford Online
Women Rise: Sheri Sheppard
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 10
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 11
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 12
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 13
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 14
Stanford Online
Stanford Webinar - Cloud Computing: What’s on the Horizon with Dr. Timothy Chou
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 15
Stanford Online
Stanford Seminar - Multi-Sensory Neural Objects: Modeling, Inference, and Applications in Robotics
Stanford Online
Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 16
Stanford Online
Stanford Seminar - Toward Better Human-AI Group Decisions
Stanford Online
Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 17
Stanford Online
Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 18
Stanford Online
Stanford Webinar - Web3 Considered: Possible Futures for Decentralization and Digital Ownership
Stanford Online
Stanford Seminar - Ethics Governance-in-the-Making: Bridging Ethics Work & Governance Menlo Report
Stanford Online
Stanford Seminar - Towards Generalizable Autonomy: Duality of Discovery & Bias
Stanford Online
Stanford Seminar - ML Explainability Part 1 I Overview and Motivation for Explainability
Stanford Online
Stanford Seminar - ML Explainability Part 2 I Inherently Interpretable Models
Stanford Online
Stanford Seminar - ML Explainability Part 3 I Post hoc Explanation Methods
Stanford Online
Kratika Gupta talks about Stanford's Product Management Program
Stanford Online
Stanford Seminar - Making Teamwork an Objective Discipline - Sid Sijbrandij CEO & Chairman of GitLab
Stanford Online
Stanford Seminar - ML Explainability Part 4 I Evaluating Model Interpretations/Explanations
Stanford Online
Stanford Seminar - Adaptable Robotic Manipulation Using Tactile Sensors
Stanford Online
Stanford Seminar - ML Explainability Part 5 I Future of Model Understanding
Stanford Online
Meet Joe Lapin, Innovation and Entrepreneurship Program Completer
Stanford Online
Stanford Seminar: Social Media Scrutiny of Frontline Professionals & Implications for Accountability
Stanford Online
Stanford Seminar - Alphy and Alphy Reflect: creating a reflective mirror to advance women
Stanford Online
Stanford Webinar - The Digital Future of Health
Stanford Online
Stanford CS229M - Lecture 1: Overview, supervised learning, empirical risk minimization
Stanford Online
Stanford CS229M - Lecture 2: Asymptotic analysis, uniform convergence, Hoeffding inequality
Stanford Online
Stanford CS229M - Lecture 3: Finite hypothesis class, discretizing infinite hypothesis space
Stanford Online
Stanford Seminar - Decentralized Finance (DeFi)
Stanford Online
Stanford CS229M - Lecture 4: Advanced concentration inequalities
Stanford Online
Stanford Seminar - Bridging AI & HCI: Incorporating Human Values into the Development of AI Tech
Stanford Online
More on: Unsupervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Machine Learning Roadmap for Beginners in 2026
Medium · Machine Learning
Machine Learning Roadmap for Beginners in 2026
Medium · Programming
Why is deep learning important in data science?
Medium · Machine Learning
We Built a Fake Review Detector That Worked Perfectly — Until Someone Ran It Through BypassGPT
Medium · Machine Learning
🎓
Tutor Explanation
DeepCamp AI