StatQuest: DESeq2, part 1, Library Normalization

StatQuest with Josh Starmer · Beginner ·📄 Research Papers Explained ·9y ago

Skills: Reading ML Papers90%Research Methods80%Paper Reproduction70%

Key Takeaways

The video explains the library normalization process in DESeq2, a program used to identify differentially expressed genes, using log base e calculations and scaling factors to account for read depth and library composition. It covers the steps involved in normalizing libraries, including averaging rows, filtering out genes with infinity as their average, and calculating median ratios.

Full Transcript

stack Quest stack Quest oo I love you stack Quest hello and welcome to stat Quest stat Quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill today we're going to talk about de seek 2 which is a program people use to identify differential gene expression de seek 2 is a big and complicated program so we're going to break it down into Parts this is part one Library normalization remember rpkm fpkm and TPM those nice methods for adjusting for differences and overall read counts among libraries de 2 doesn't use those methods and neither does EDR by the way why not there are two main problems in library normalization so let's talk about them problem number one adjusting for differences in library sizes for the sake of keeping the example simple let's assume there are only six genes in the genome here we have sample number one which has a total of 635 re reads mapped to it and here we see how the reads are distributed among the six genes sample number two has 1,270 reads mapping to it and we see how those reads are distributed among the genes in the genome sample number one has half as many reads as sample number two the read counts for each gene in Sample number two are twice the read counts in Sample number one this difference is not due to biology but to sequencing depth rpkm fpkm TPM and CPM all deal with this no big deal however there is another problem problem number two adjusting for differences in library composition RNA seek and other high throughput sequencing is often used to compare one tissue type to another for example liver versus spleen it could be that there are a lot of liver specific genes transcribed in liver but not in the spleen this is an example of a difference in library composition you can also imagine seeing differences in library composition in the same tissue type if you knock out a transcription factor or something that regulates Gene transcription let's look at a specific example in this example both libraries are the same size both have 635 reads now assume expression of all genes is the same with one exception assume that only sample number one transcribes the gene called a2m to make matters worse sample number one transcribes a2m at a very high level here we see that 563 reads of a total of 635 reads mapped to this Gene in Sample number one this means that the 563 reads used up by a2m in Sample number one will be distributed to other genes in Sample number two here we see the read counts for all of the genes in Sample number one and Sample number two the read counts for everything but a2m are crazy high in Sample number two however the only differentially expressed Gene is a2m because sample number two does not transcribe a2m all of the other genes get the read counts that would have gone to it and this makes those read counts larger the folks that wrote Deek 2 and EDR were aware that their tools would be used with all kinds of data sets so they wanted their normalization to to handle one differences in library sizes and two differences in library composition we'll start with a small data set to illustrate how DEC 2 scales the different samples the goal is to calculate a scaling factor for each sample the scaling Factor has to take read depth and Library composition into account so the first thing that de 2 does is it takes the log of all the values de seek 2 uses the log base e so these numbers are what we would need to raise E2 in order to get the original value so if the original read count was 10 we'd have to raise e to 2.3 to get that value de seek 2 could have used log base 2 or log base 10 but log based e is the default and R which is the programming language that was used to create de2 anyways I think because it's the default that's why they chose it notice that the log of0 equals negative Infinity this is just because R defines log of zero to be negative Infinity if you'd like to learn more about logs check out the stat Quest on logs the next thing that de seek 2 does is it averages each row anytime you add a number to infinity or negative Infinity you end up with infinity or negative Infinity which is why the average for Gene one is negative Infinity one cool thing about the average of log values is that the average is not easily swayed by outliers to see this let's calculate the average read count for Gene 3 we see that the read counts for Gene 3 in Sample number three are really high that makes it an outlier if we just average the raw read counts for Gene 3 we get 96 now convert the average log value for gene3 into a normal number remember that logs are exponents and in this case they are exponents of e so we have to raise e by 4.3 to get a normal number e raised by 4.3 equal 73.7 the average calculated with the logs is smaller and thus not swayed as much by the outlier and for all you stat questers out there that can remember the names of things averages calculated with logs are called geometric averages hooray we've made it all the way through step two only five more steps to go step three is is an easy step filter out genes with infinity as their average so in this case we're going to filter out Gene number one in general this step filters out genes with zero read counts in one or more samples if you are comparing liver and spleen this will remove all of the genes only transcribed in liver or spleen in theory this helps Focus the scaling factors on the housekeeping Gene genes transcribed at similar levels regardless of tissue type step four subtract the average log value from the log of the counts here we have the log of the counts for each gene in each sample and here we have the average of the log values and all we have to do is subtract that average from each sample so in this case for Gene number two in example number one we subtract 1.7 from 0.7 that gives us -1 for Gene number three in Sample number one we subtract 4.3 from 3.5 that gives us 0.8 and we just do the same thing for all the other genes remember when we subtract the log of one value from the log of another value that's the same thing as the log of dividing those two values or the ratio of those two values so we're really checking out the ratio of the reads in each sample to the average across all samples this will allow us to identify genes within each sample that are expressed at levels significantly higher than the average or close to the average or significantly less than the average okay now let's move on to step five calculate the median of the ratios for each sample so here we have the log of the ratios of the reads for each gene divided by the average for each gene and all we have to do is calculate the median value for each sample note using the median is another way to avoid extreme genes from swaying the value too much in One Direction genes with huge differences in expression have no more influence on the median than genes with minor differences since genes with huge differences will most likely be rare the effect is to give more influence to moderate differences and housekeeping genes okay now we're ready for step six convert the medians to normal numbers to get the final scaling factors for each sample again these are log values so they are exponents in this case exponents for E to calculate the scaling factor for each sample we raise e to the median value for each sample awesome we have scaling factors for the three samples now all we do is divide the original read counts by them that leads us to step seven divide the original read counts by the scaling factors here's our table that lists the original read counts note Gene one is part of the original read counts we didn't use Gene one when we calculated the scaling factors because it has a zero for read counts in Sample number one however we still need to scale the read counts for the other samples and here we have a table of scaled read counts I've rounded to the nearest read just to make this table easy to look at we can see that the read counts for sample number one were scaled up and that the read counts for sample number three were scaled down here's a summary of De 2's Library size scaling Factor logs eliminate all genes that are only transcribed in one sample type liver versus spleen they also help smooth over outlier read counts via the geometric mean the median further downplays genes that se soak up a lot of reads putting more emphasis on moderately expressed genes the idea behind using logs and the median is to hopefully Focus the scaling Factor on just the housekeeping genes the genes that are transcribed at the same levels in all of the samples you're looking at hooray we've made it to the end we now know how de 2 normalizes the read counts in each Library tune in next time when we talk about how Edge R normalizes the read counts until then Quest on

Original Description

DESeq2 is a complicated program used to identified differentially expressed genes. Here I clearly explain the first thing it does, normalize the libraries. There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store... https://statquest.org/statquest-store/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer Correction: 9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together. #statquest #rnaseq #deseq2

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from StatQuest with Josh Starmer · StatQuest with Josh Starmer · 36 of 60

← Previous Next →

StatQuest with Josh Starmer

StatQuest with Josh Starmer

R-squared, Clearly Explained!!!

R-squared, Clearly Explained!!!

StatQuest with Josh Starmer

Wrapping up dumplings for pot stickers.

Wrapping up dumplings for pot stickers.

StatQuest with Josh Starmer

The standard error, Clearly Explained!!!

The standard error, Clearly Explained!!!

StatQuest with Josh Starmer

That Dude (in the movies)

That Dude (in the movies)

StatQuest with Josh Starmer

How to puree garlic

How to puree garlic

StatQuest with Josh Starmer

Confidence Intervals, Clearly Explained!!!

Confidence Intervals, Clearly Explained!!!

StatQuest with Josh Starmer

RPKM, FPKM and TPM, Clearly Explained!!!

RPKM, FPKM and TPM, Clearly Explained!!!

StatQuest with Josh Starmer

Principal Component Analysis (PCA) clearly explained (2015)

Principal Component Analysis (PCA) clearly explained (2015)

StatQuest with Josh Starmer

StatQuest: RNA-seq - the problem with technical replicates

StatQuest: RNA-seq - the problem with technical replicates

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Christmas In Rio! (now on iTunes!)

Christmas In Rio! (now on iTunes!)

StatQuest with Josh Starmer

Drawing and Interpreting Heatmaps

Drawing and Interpreting Heatmaps

StatQuest with Josh Starmer

Rachel's Song (the ballad of Hazel Motes)

Rachel's Song (the ballad of Hazel Motes)

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Say Your Goodbyes

Say Your Goodbyes

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: Linear Discriminant Analysis (LDA) clearly explained.

StatQuest: Linear Discriminant Analysis (LDA) clearly explained.

StatQuest with Josh Starmer

Maybe It'll Go Away

Maybe It'll Go Away

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest with Josh Starmer

p-hacking and power calculations

p-hacking and power calculations

StatQuest with Josh Starmer

StatQuest with Josh Starmer

The Coldest Day of the Year

The Coldest Day of the Year

StatQuest with Josh Starmer

StatQuest with Josh Starmer

False Discovery Rates, FDR, clearly explained

False Discovery Rates, FDR, clearly explained

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuickie: Thresholds for Significance

StatQuickie: Thresholds for Significance

StatQuest with Josh Starmer

Logs (logarithms), Clearly Explained!!!

Logs (logarithms), Clearly Explained!!!

StatQuest with Josh Starmer

Bar Charts Are Better than Pie Charts

Bar Charts Are Better than Pie Charts

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuickie: Which t test to use

StatQuickie: Which t test to use

StatQuest with Josh Starmer

Fisher's Exact Test and the Hypergeometric Distribution

Fisher's Exact Test and the Hypergeometric Distribution

StatQuest with Josh Starmer

Standard Deviation vs Standard Error, Clearly Explained!!!

Standard Deviation vs Standard Error, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: DESeq2, part 1, Library Normalization

StatQuest: DESeq2, part 1, Library Normalization

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: edgeR, part 1, Library Normalization

StatQuest: edgeR, part 1, Library Normalization

StatQuest with Josh Starmer

The Main Ideas behind Probability Distributions

The Main Ideas behind Probability Distributions

StatQuest with Josh Starmer

StatQuest: One or Two Tailed P-Values

StatQuest: One or Two Tailed P-Values

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Sampling from a Distribution, Clearly Explained!!!

Sampling from a Distribution, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: edgeR and DESeq2, part 2 - Independent Filtering

StatQuest: edgeR and DESeq2, part 2 - Independent Filtering

StatQuest with Josh Starmer

The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)

The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)

StatQuest with Josh Starmer

The Sum of Regrets

The Sum of Regrets

StatQuest with Josh Starmer

Lowess and Loess, Clearly Explained!!!

Lowess and Loess, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: Hierarchical Clustering

StatQuest: Hierarchical Clustering

StatQuest with Josh Starmer

StatQuest: K-nearest neighbors, Clearly Explained

StatQuest: K-nearest neighbors, Clearly Explained

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Boxplots are Awesome!!!

Boxplots are Awesome!!!

StatQuest with Josh Starmer

What is a (mathematical) model?

What is a (mathematical) model?

StatQuest with Josh Starmer

Linear Regression, Clearly Explained!!!

Linear Regression, Clearly Explained!!!

StatQuest with Josh Starmer

Linear Regression in R, Step-by-Step

Linear Regression in R, Step-by-Step

StatQuest with Josh Starmer

Maximum Likelihood, clearly explained!!!

Maximum Likelihood, clearly explained!!!

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Using Linear Models for t-tests and ANOVA, Clearly Explained!!!

Using Linear Models for t-tests and ANOVA, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: How to make a Mean Pizza Crust!!!

StatQuest: How to make a Mean Pizza Crust!!!

StatQuest with Josh Starmer

StatQuest: A gentle introduction to RNA-seq

StatQuest: A gentle introduction to RNA-seq

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: t-SNE, Clearly Explained

StatQuest: t-SNE, Clearly Explained

StatQuest with Josh Starmer

This video teaches the fundamentals of library normalization in DESeq2, a crucial step in identifying differentially expressed genes. By understanding how DESeq2 normalizes libraries, researchers can better analyze their RNA-seq data and draw meaningful conclusions. The video covers the key concepts and steps involved in library normalization, including log base e calculations and scaling factors.

Key Takeaways

Average each row to avoid infinity or negative infinity
Filter out genes with infinity as their average
Subtract average log value from log of counts to check ratio of reads to average
Calculate median of ratios for each sample to avoid extreme genes
Convert medians to normal numbers to get final scaling factors
Divide original read counts by scaling factors to normalize read counts
Eliminate genes only transcribed in one sample type via log scaling

💡 The use of log base e calculations and scaling factors in DESeq2 allows for accurate library normalization, which is essential for identifying differentially expressed genes.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling