Statistical Learning: 12.6 Breast Cancer Example

Stanford Online · Beginner ·📐 ML Fundamentals ·3y ago

Skills: Unsupervised Learning90%ML Maths Basics70%

Statistical Learning, featuring Deep Learning, Survival Analysis and Multiple Testing Trevor Hastie, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statistics.stanford.edu/people/trevor-j-hastie Robert Tibshirani, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statistics.stanford.edu/people/robert-tibshirani Jonathan Taylor, Professor Statistics at Stanford University - https://statistics.stanford.edu/people/jonathan-taylor You are able to take Statistical Learning as an online course on EdX, and you are able to choose a verified path and get a certificate for its completion. You can choose to take the course in R (https://www.edx.org/course/statistica) or in Python (https://www.edx.org/learn/data-analysis-statistics/stanford-university-statistical-learning-with-python) For more information about courses on Statistics, you can browse our Stanford Online Catalog: https://stanford.io/3QHRi72

What You'll Learn

The video discusses statistical learning using the example of breast cancer, covering topics such as hierarchical clustering, gene expression, and principal components, with a focus on unsupervised learning techniques like clustering and dimensionality reduction.

Full Transcript

welcome back in this in this the last segment of this section we're going to we're going to see an example of hierarchical clustering applied to a study of breast cancer so this is the last part of the segment this is an example which actually Trevor and I are both involved in um actually 10 years ago now with a postdoc at Stanford in oncology Therese sorely um Therese had measured gene expression from Gene chips for about 88 women who had breast cancer were being treated for breast cancer and Gene special measures for about 8 000 genes so what that means is for for each of the 88 patients there's a quantitative measurement for each of eight thousand genes um which measures how how much that Gene was expressing how active it was for that woman right and this is a very common kind of study now where people look at gene expression to try to understand the basis of diseases like breast cancer and figure out whether there are subtypes of of the disease which should be treated in a different way so this is quite large the amount of data where 88 patients 8 000 features she used or we the group used um average Nike's with correlation metric again because this is the case where genes are in the same units in a sense but they're measured in the same units but um the actual level of gene expression wasn't very reliable because it varies across the way it's measured but was more thought to be more important was the shape of the the relative expression of different genes for the same patient so that's why we use correlation metric um and we did harmful clustering of the samples of the 88 patients now when Therese first used the full set of genes the clustering she got out wasn't uh wasn't satisfactory now what does that mean well it's again it's very subjective but uh it didn't it wasn't very informative to her to to Therese and her collaborators so rather they use a subset of the genes called the intrinsic Gene so this is a a way of choosing a most more informative subset of genes and I'll go into the detail except to say in words in this particular study these women were given chemotherapy and there was actually a sample taken before and after for each woman and gene expression measurements were available before and after so what Therese did was she she defined what called intrinsic genes so for each woman for each gene if each woman we look to see which genes had the smallest variation within a woman within a woman as opposed to between the 88 women and the ones with the smallest variation were were defined to be the intrinsic genes of 500 most this is the 500 genes with the lowest variation the idea being again this is a biological concept was that genes which didn't vary much in a woman before and after chemotherapy compared to the the between one variation were thought to be intrinsic to her cell biology so they are thought to be the ones that could could best Drive the clustering and and separate the women in terms of their biology and maybe their response treatment they varied a lot between women but little within women across the two repeated measures so doing that we uh we got the following clustering so what do we see here first of all this is the um here are the 500 or so intrinsic genes and this is called a heat map and this is a common display for this kind of data so what do we see here each row of the heat map is a gene 500 sub genes each column is a woman one of the 88 women and each pixel is displayed as either uh green which is negative so the gene expression it's normalized so it runs from something like -5 to plus five so green would be negative and red is positive so green means the gene expression for that Gene for that woman is lower than average and red means it's higher than average and what's been done here is we applied hierarchical clustering to the columns that's the women in the way I just described in addition hierarchical question was done to the rows the genes this is done in both directions and that's why this picture looks it's got It's uh has patches of red and green right because it looks because we've sorted them basically we've sorted the hierarchical clustering and we sorted the observations by the order of the leaves in the tree both for genes and both for and for samples and that's why if we just display the data in the order we've obtained it this picture would not look so nice so it would be a checkerboard pattern it would look look very random but see here here it looks much more structured because of the clustering has been applied and The Ordering of the leaves is it's been used to reorder the rows and columns um just to this actually this kind of display would actually start used first at Stanford in the in the genomics labs and around the time the the gene chip is invented which is also done partly here at Stanford and I think this has become very attractive just because it's just it's a nice way just to see all the data right if you have if you're given uh a data set of 88 observations women and in 8 000 genes that's a lot of data just to even look at and so the first kind of challenge is how do I just make a display so I can look at all the data and see that the gross patterns and here this is actually a very effective display this is one of the pioneering if it's in the in the labs of Patrick Brown and David botstein with gene expression really started so you know it's kind of sort of fun to think that a you know a a pioneering um piece of science is actually displayed but that's often the case right some things some very simple things which might be seem trivial actually can have a lot of impact just in this case just the ability to look at to arrange and display the data informatively was very useful and it's still used a lot today so here's the full heat map and and then the the clustering tree is at the top this is here and it's been expanded out here and um it's it's been divided into one two three four five six seven eight clusters um the gray is just basically a unknown group but the other clusters have been labeled by names like normal bazel erb2 luminal a luminal b these names were chosen by Therese and collaborators based on the genes that we're expressing in the group in the groups so now if we look at this picture what we've taken is the same clusters and we've just taken subsets of the rows that's these five groups and these are genes which are expressing highly in one or more of these key groups like for example here let's see um this block of sea genes is expressing highly in the red group and the blue group this block of D genes expressing highly in these clusters Etc so then uh the oncologist will will look at this and they'll try to understand well how are the so these groups are different with respect to these particular genes what do these genes do in the cells and um what does it tell us about the these these subgroups in particular let's onto the the last display of this if you look at these subgroups you look at their survival of these women these are called copper survival curves right um these are these women were treated with cancer for cancer and followed up to see how and hopefully you know uh recovered some didn't and the survival curves of the groups are given here so for example uh the basal group I believe the red and the purple which groups are those Basils uh basil and herbie2 are doing not nearly as well they're probably survival is much worse whereas the um this group did not group The luminal a is doing much better so because the survival is quite different the scientists were really wanting to find out how are these groups different and with respect to what genes and that gives us a clue as to how the diseases might be different in the different groups so that's example of clustering for in a real scientific problem that's of importance so just to wrap up this section now unsupervised learning is what has been the topic we've talked about principal components and clustering uh and that's they're important in general for for understanding the variation and grouping structure of a set of unlabeled data so they can be useful for themselves um just by themselves that we saw for example in the that last example or as a preprocessor to choose a linear combination of features for supervised learning um and we also saw that the problem is intrinsically harder than supervised learning because there's no there's no label there's no gold standard so we can't use you can't use prediction error to figure out how well we're doing we've just shown you two techniques in in these presentations principal components and clustering and those are part of a big tool by a bag of lots of other techniques some of them are listed here like self-organizing Maps independent component analysis spectral clustering and many more many more of these are covered in our book elements of statistical learning in chapter 14 and even beyond that there are many others as well

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Stanford Online · Stanford Online · 7 of 60

← Previous Next →

Statistical Learning: 13.2 Introduction to Multiple Testing and Family Wise Error Rate

Statistical Learning: 13.2 Introduction to Multiple Testing and Family Wise Error Rate

Stanford Online

Statistical Learning: 13.1 Introduction to Hypothesis Testing II

Statistical Learning: 13.1 Introduction to Hypothesis Testing II

Stanford Online

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Statistical Learning: 12.R.2 K means Clustering

Statistical Learning: 12.R.2 K means Clustering

Stanford Online

Statistical Learning: 12.R.1 Principal Components

Statistical Learning: 12.R.1 Principal Components

Stanford Online

Statistical Learning: 13.R.1 Bonferroni and Holm II

Statistical Learning: 13.R.1 Bonferroni and Holm II

Stanford Online

Statistical Learning: 12.6 Breast Cancer Example

Statistical Learning: 12.6 Breast Cancer Example

Stanford Online

Statistical Learning: 12.5 Matrix Completion

Statistical Learning: 12.5 Matrix Completion

Stanford Online

Statistical Learning: 12.4 Hierarchical Clustering

Statistical Learning: 12.4 Hierarchical Clustering

Stanford Online

Statistical Learning: 12.3 k means Clustering

Statistical Learning: 12.3 k means Clustering

Stanford Online

Statistical Learning: 13.1 Introduction to Hypothesis Testing

Statistical Learning: 13.1 Introduction to Hypothesis Testing

Stanford Online

Stanford Seminar - Introduction to Web3

Stanford Seminar - Introduction to Web3

Stanford Online

Stanford Seminar - Designing Equitable Online Experiences

Stanford Seminar - Designing Equitable Online Experiences

Stanford Online

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 1

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 1

Stanford Online

Stanford Seminar - Perceiving, Understanding, and Interacting through Touch

Stanford Seminar - Perceiving, Understanding, and Interacting through Touch

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 2

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 2

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 3

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 3

Stanford Online

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 4

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 4

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 5

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 5

Stanford Online

Stanford Seminar - Evolution of a Web3 Company

Stanford Seminar - Evolution of a Web3 Company

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 6

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 6

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 7

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 7

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 8

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 8

Stanford Online

Stanford Seminar - Designing Human-Centered AI Systems for Human-AI Collaboration

Stanford Seminar - Designing Human-Centered AI Systems for Human-AI Collaboration

Stanford Online

The Sh*tFixers: Bob Sutton Interviews David Kelley, Design Thinking Superstar

The Sh*tFixers: Bob Sutton Interviews David Kelley, Design Thinking Superstar

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 9

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 9

Stanford Online

Women Rise: Sheri Sheppard

Women Rise: Sheri Sheppard

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 10

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 10

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 11

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 11

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 12

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 12

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 13

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 13

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 14

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 14

Stanford Online

Stanford Webinar - Cloud Computing: What’s on the Horizon with Dr. Timothy Chou

Stanford Webinar - Cloud Computing: What’s on the Horizon with Dr. Timothy Chou

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 15

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 15

Stanford Online

Stanford Seminar - Multi-Sensory Neural Objects: Modeling, Inference, and Applications in Robotics

Stanford Seminar - Multi-Sensory Neural Objects: Modeling, Inference, and Applications in Robotics

Stanford Online

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 16

Stanford CS330: Deep Multi-task & Meta Learning I 2021 I Lecture 16

Stanford Online

Stanford Seminar - Toward Better Human-AI Group Decisions

Stanford Seminar - Toward Better Human-AI Group Decisions

Stanford Online

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 17

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 17

Stanford Online

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 18

Stanford CS330: Deep Multi-Task & Meta Learning I 2021 I Lecture 18

Stanford Online

Stanford Webinar - Web3 Considered: Possible Futures for Decentralization and Digital Ownership

Stanford Webinar - Web3 Considered: Possible Futures for Decentralization and Digital Ownership

Stanford Online

Stanford Seminar - Ethics Governance-in-the-Making: Bridging Ethics Work & Governance Menlo Report

Stanford Seminar - Ethics Governance-in-the-Making: Bridging Ethics Work & Governance Menlo Report

Stanford Online

Stanford Seminar - Towards Generalizable Autonomy: Duality of Discovery & Bias

Stanford Seminar - Towards Generalizable Autonomy: Duality of Discovery & Bias

Stanford Online

Stanford Seminar - ML Explainability Part 1 I Overview and Motivation for Explainability

Stanford Seminar - ML Explainability Part 1 I Overview and Motivation for Explainability

Stanford Online

Stanford Seminar - ML Explainability Part 2 I Inherently Interpretable Models

Stanford Seminar - ML Explainability Part 2 I Inherently Interpretable Models

Stanford Online

Stanford Seminar - ML Explainability Part 3 I Post hoc Explanation Methods

Stanford Seminar - ML Explainability Part 3 I Post hoc Explanation Methods

Stanford Online

Kratika Gupta talks about Stanford's Product Management Program

Kratika Gupta talks about Stanford's Product Management Program

Stanford Online

Stanford Seminar - Making Teamwork an Objective Discipline - Sid Sijbrandij CEO & Chairman of GitLab

Stanford Seminar - Making Teamwork an Objective Discipline - Sid Sijbrandij CEO & Chairman of GitLab

Stanford Online

Stanford Seminar - ML Explainability Part 4 I Evaluating Model Interpretations/Explanations

Stanford Seminar - ML Explainability Part 4 I Evaluating Model Interpretations/Explanations

Stanford Online

Stanford Seminar - Adaptable Robotic Manipulation Using Tactile Sensors

Stanford Seminar - Adaptable Robotic Manipulation Using Tactile Sensors

Stanford Online

Stanford Seminar - ML Explainability Part 5 I Future of Model Understanding

Stanford Seminar - ML Explainability Part 5 I Future of Model Understanding

Stanford Online

Meet Joe Lapin, Innovation and Entrepreneurship Program Completer

Meet Joe Lapin, Innovation and Entrepreneurship Program Completer

Stanford Online

Stanford Seminar: Social Media Scrutiny of Frontline Professionals & Implications for Accountability

Stanford Seminar: Social Media Scrutiny of Frontline Professionals & Implications for Accountability

Stanford Online

Stanford Seminar - Alphy and Alphy Reflect: creating a reflective mirror to advance women

Stanford Seminar - Alphy and Alphy Reflect: creating a reflective mirror to advance women

Stanford Online

Stanford Webinar - The Digital Future of Health

Stanford Webinar - The Digital Future of Health

Stanford Online

Stanford CS229M - Lecture 1: Overview, supervised learning, empirical risk minimization

Stanford CS229M - Lecture 1: Overview, supervised learning, empirical risk minimization

Stanford Online

Stanford CS229M - Lecture 2: Asymptotic analysis, uniform convergence, Hoeffding inequality

Stanford CS229M - Lecture 2: Asymptotic analysis, uniform convergence, Hoeffding inequality

Stanford Online

Stanford CS229M - Lecture 3: Finite hypothesis class, discretizing infinite hypothesis space

Stanford CS229M - Lecture 3: Finite hypothesis class, discretizing infinite hypothesis space

Stanford Online

Stanford Seminar - Decentralized Finance (DeFi)

Stanford Seminar - Decentralized Finance (DeFi)

Stanford Online

Stanford CS229M - Lecture 4: Advanced concentration inequalities

Stanford CS229M - Lecture 4: Advanced concentration inequalities

Stanford Online

Stanford Seminar - Bridging AI & HCI: Incorporating Human Values into the Development of AI Tech

Stanford Seminar - Bridging AI & HCI: Incorporating Human Values into the Development of AI Tech

Stanford Online

This video teaches statistical learning techniques using the example of breast cancer, focusing on unsupervised learning methods such as hierarchical clustering and principal components. It highlights the challenges of unsupervised learning and the importance of understanding variation and grouping structure in unlabeled data.

Key Takeaways

Collect and preprocess gene expression data from breast cancer patients
Apply hierarchical clustering to identify intrinsic genes
Use principal components to reduce dimensionality and understand variation
Compare survival rates among different breast cancer groups
Evaluate the effectiveness of unsupervised learning techniques

💡 Unsupervised learning is harder than supervised learning due to the lack of labels, but techniques like clustering and dimensionality reduction can help uncover hidden patterns and structures in the data.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Unsupervised Learning

View skill →

How to implement K-Means from scratch with Python

How to implement K-Means from scratch with Python

K-Means Clustering - The Math of Intelligence (Week 3)

K-Means Clustering - The Math of Intelligence (Week 3)

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Self-/Unsupervised GNN Training

Self-/Unsupervised GNN Training

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Clustering with DBSCAN, Clearly Explained!!!

Clustering with DBSCAN, Clearly Explained!!!

StatQuest with Josh Starmer

Related AI Lessons

FastAPI for Production AI: From Notebook to Scalable APIs

Learn to deploy machine learning models to production using FastAPI, bridging the gap from local scripts to scalable APIs

FastMCP 3.0 Cut My MCP Server Code in Half. Here’s How.

Learn how FastMCP 3.0 can simplify MCP server code, reducing it by half, and how to apply it in a Python implementation

Medium · Python

Price elasticity model [R]

Learn to build a price elasticity model using machine learning to predict quantity sold based on price and discount at the product group level

Reddit r/MachineLearning

Beyond the Credit Score: What 1.3 Million Loans Reveal About Who Actually Repays

Analyzing 1.3 million loans reveals new insights on who repays loans, beyond traditional credit scores, and how machine learning can improve lending decisions

Medium · Machine Learning

Learn Deep Learning by Hand (Beginner's Guide - Part 1)