RPKM, FPKM and TPM, Clearly Explained!!!

StatQuest with Josh Starmer · Beginner ·📄 Research Papers Explained ·10y ago

Skills: Reading ML Papers80%ML Maths Basics60%

Key Takeaways

This video explains the differences between RPKM, FPKM, and TPM, which are metrics used to normalize read counts in high-throughput RNA-seq experiments, specifically in the context of StatQuest with Josh Starmer.

Full Transcript

stat Quest stat Quest stat Quest hello and welcome to stat Quest statquest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill today we're going to be talking about rpkm versus fpkm versus fpm in previous stat quests we've talked about topics that are broadly applicable to a variety of fields these subjects however really only relate to high throughput sequencing of RNA so if that's what you're interested in quest on there's a new RNA seek metric on the Block we used to report rpkm or reads per kilobase million or fpkm fragments per kilobase million these normalized read counts for the sequencing depth that's the million part it's because sequencing runs with more depth will have more reads mapping to each gene than sequencing runs with less depth and we don't want that to bias our analysis the other thing that these metrics normalize for are the lengths of genes that's the kilab base part longer genes will naturally have more reads mapping to them than shorter reads and it's important to remove that body from the analysis as well however nowadays they want us to use TPM which stands for transcripts per million So today we're going to talk about these three things and how they're related and how they're different and how I think that using TPM is actually a really good idea let's start with an example to understand the differences between TPM and rpkm and fpkm will work through the math using an imaginary RNA seek data set with three replicates rep 1 2 and three for a genome with only four genes a b c and d on the left we see the names of our genes and the lengths of each gene to the right we see the read counts for each replicate we see that replicate 3 has way more reads than the other replicates regardless of the gene this means it had higher SE sequencing depth than the other replicates we're going to normalize for that we also see that Gene B is twice as long as Gan a and that might explain why it always gets twice as many reads regardless of the replica we're going to normalize for this two first we're going to normalize the data using the familiar rpkm metric with rpkm the first step is to normalize for read depth here we've calculated the total number of reads in each replica for the purpose of this 4 Gene example we're going to scale the total read counts by 10 instead of 1 million this will make the numbers easier to read but in future slides consider tens of reads or millions of reads to be interchangeable just for this example originally 1 million was picked just to make the numbers look nice so they wouldn't require too many decimal places thus these are our per million scaling factors for each replicate and by scaling I mean we're just going to divide the read counts for each gene by the appropriate scaling factor for that replicate and using those per million scaling factors we can calculate the reads per million for each replicate the second step for normalizing by rpkm is to normalize for Gene length here we've got the gene lengths on the left side all we have to do now is scale per kilobase after dividing by the length of the genes the reads are scaled for depth M and for Gene length K so now we have rpkm to summarize rpkm we have our before data which isn't normalized for depth or Gene length but then we normalized for differences in sequencing depth and differences in gene size and so here's our after data our rpkm values for each replicate and each gene now rpkm and fpkm are two very closely related terms however there's a lot of confusion related to them so let's clear that up right now first r rpkm as you know stands for reads per kilobase million fpkm stands for fragments per kilobase million the only difference is that rpkm is for single-end RNA seek and fpkm is for paired in RNA seek to illustrate why it's necessary to make this distinction consider a fragment that is to be sequenced with single end sequencing there is only one read sequenced per fragment that read is either on one end of the fragment or the other end of the fragment in contrast with paired end sequencing both ends can map giving you two reads per fragment or sometimes only one end of the paired end has a quality read and maps in this case you only get one read mapping to a fragment all fpkm does is keep track of the fragments so that one fragment with two reads mapping to it is not counted twice now that we know what rpkm and fpkm are let's talk about TPM or transcripts per million TPM is like rpkm and fpkm except the order of the operations is Switched let's look at an example for TPM the first step is to normalize for Gene length so after dividing each read count by the gene length we have rpk or reads per kbase the second step for TPM is to normalize for sequencing depth we do this by adding up the read counts that have already been normalized for Gene length and get the total for each replica we then divide this total by some number usually that's million but for this for Gene example we're just going to divide by 10 this gives us our scaling factors now we divide the read counts that have already been normalized for Gene length by our new scaling factors and that gives us TPM that's all there is to it we did the same things we did for rpkm and fpkm except in this case we just did them in a different order however this will have profound effects on the results here's a comparison of the same same original data sets scaled for rpkm and TPM above we have rpkm and Below we have TPM both TPM and rpkm correct for biases in gene length and sequencing depth but the sums of total normalized reads in each column are very different in rpkm we get a different value for each sample with TPM we get the same value for each column why is this important I'll show you now consider three pies each the same size in this case each Pi is size 10 the TPM values in each replicate represent slices in these pies and we can tell from each slice What proportion of the total reads in that replicate went to the each gene for example in gene a in replicate one we see that the size of that slice is 3.33 this slice is larger than the slice in replica 3 which is size 3.32 6 so it's just a little the slice and replicate one is just a little larger than the slice and replicate 3 however what this tells us is that of all the reads that mapped to replicate one a larger proportion of them mapped to Gene a whereas of all the reads that mapped in replicate 3 a slightly smaller proportion of them mapped to Gene a this is because some of those reads also mapped Gene D in replicate 3 and none of the reads mapped Gene D in replicate 1 with rpkm it's harder to compare the proportion of total reads that map to each gene because each replicate has a different total overall that is to say each Pi has a different size that makes it difficult to compare the slices between each one the main point with TPM everyone gets the same sized pie all right in all seriousness folks are using TPM because the numbers can clearly tell you what proportion of reads map to What gene in each sample and since RNA seek is all about comparing relative proportions of reads this metric seems more appropriate the end I hope this helps you understand the difference between these different metrics tune in next time for another exciting stat Quest

Original Description

A StatQuest http://statquest.org/ about RPKM, FPKM and TPM. These terms are for high-throughput RNA-seq experiments. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store... https://statquest.org/statquest-store/ ...or just donating to StatQuest! https://www.paypal.me/statquest #statquest #rnaseq #rpkm

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from StatQuest with Josh Starmer · StatQuest with Josh Starmer · 9 of 60

← Previous Next →

StatQuest with Josh Starmer

StatQuest with Josh Starmer

R-squared, Clearly Explained!!!

R-squared, Clearly Explained!!!

StatQuest with Josh Starmer

Wrapping up dumplings for pot stickers.

Wrapping up dumplings for pot stickers.

StatQuest with Josh Starmer

The standard error, Clearly Explained!!!

The standard error, Clearly Explained!!!

StatQuest with Josh Starmer

That Dude (in the movies)

That Dude (in the movies)

StatQuest with Josh Starmer

How to puree garlic

How to puree garlic

StatQuest with Josh Starmer

Confidence Intervals, Clearly Explained!!!

Confidence Intervals, Clearly Explained!!!

StatQuest with Josh Starmer

RPKM, FPKM and TPM, Clearly Explained!!!

RPKM, FPKM and TPM, Clearly Explained!!!

StatQuest with Josh Starmer

Principal Component Analysis (PCA) clearly explained (2015)

Principal Component Analysis (PCA) clearly explained (2015)

StatQuest with Josh Starmer

StatQuest: RNA-seq - the problem with technical replicates

StatQuest: RNA-seq - the problem with technical replicates

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Christmas In Rio! (now on iTunes!)

Christmas In Rio! (now on iTunes!)

StatQuest with Josh Starmer

Drawing and Interpreting Heatmaps

Drawing and Interpreting Heatmaps

StatQuest with Josh Starmer

Rachel's Song (the ballad of Hazel Motes)

Rachel's Song (the ballad of Hazel Motes)

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Say Your Goodbyes

Say Your Goodbyes

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: Linear Discriminant Analysis (LDA) clearly explained.

StatQuest: Linear Discriminant Analysis (LDA) clearly explained.

StatQuest with Josh Starmer

Maybe It'll Go Away

Maybe It'll Go Away

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest with Josh Starmer

p-hacking and power calculations

p-hacking and power calculations

StatQuest with Josh Starmer

StatQuest with Josh Starmer

The Coldest Day of the Year

The Coldest Day of the Year

StatQuest with Josh Starmer

StatQuest with Josh Starmer

False Discovery Rates, FDR, clearly explained

False Discovery Rates, FDR, clearly explained

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuickie: Thresholds for Significance

StatQuickie: Thresholds for Significance

StatQuest with Josh Starmer

Logs (logarithms), Clearly Explained!!!

Logs (logarithms), Clearly Explained!!!

StatQuest with Josh Starmer

Bar Charts Are Better than Pie Charts

Bar Charts Are Better than Pie Charts

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuickie: Which t test to use

StatQuickie: Which t test to use

StatQuest with Josh Starmer

Fisher's Exact Test and the Hypergeometric Distribution

Fisher's Exact Test and the Hypergeometric Distribution

StatQuest with Josh Starmer

Standard Deviation vs Standard Error, Clearly Explained!!!

Standard Deviation vs Standard Error, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: DESeq2, part 1, Library Normalization

StatQuest: DESeq2, part 1, Library Normalization

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: edgeR, part 1, Library Normalization

StatQuest: edgeR, part 1, Library Normalization

StatQuest with Josh Starmer

The Main Ideas behind Probability Distributions

The Main Ideas behind Probability Distributions

StatQuest with Josh Starmer

StatQuest: One or Two Tailed P-Values

StatQuest: One or Two Tailed P-Values

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Sampling from a Distribution, Clearly Explained!!!

Sampling from a Distribution, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: edgeR and DESeq2, part 2 - Independent Filtering

StatQuest: edgeR and DESeq2, part 2 - Independent Filtering

StatQuest with Josh Starmer

The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)

The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)

StatQuest with Josh Starmer

The Sum of Regrets

The Sum of Regrets

StatQuest with Josh Starmer

Lowess and Loess, Clearly Explained!!!

Lowess and Loess, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: Hierarchical Clustering

StatQuest: Hierarchical Clustering

StatQuest with Josh Starmer

StatQuest: K-nearest neighbors, Clearly Explained

StatQuest: K-nearest neighbors, Clearly Explained

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Boxplots are Awesome!!!

Boxplots are Awesome!!!

StatQuest with Josh Starmer

What is a (mathematical) model?

What is a (mathematical) model?

StatQuest with Josh Starmer

Linear Regression, Clearly Explained!!!

Linear Regression, Clearly Explained!!!

StatQuest with Josh Starmer

Linear Regression in R, Step-by-Step

Linear Regression in R, Step-by-Step

StatQuest with Josh Starmer

Maximum Likelihood, clearly explained!!!

Maximum Likelihood, clearly explained!!!

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Using Linear Models for t-tests and ANOVA, Clearly Explained!!!

Using Linear Models for t-tests and ANOVA, Clearly Explained!!!

StatQuest with Josh Starmer

StatQuest: How to make a Mean Pizza Crust!!!

StatQuest: How to make a Mean Pizza Crust!!!

StatQuest with Josh Starmer

StatQuest: A gentle introduction to RNA-seq

StatQuest: A gentle introduction to RNA-seq

StatQuest with Josh Starmer

StatQuest with Josh Starmer

StatQuest: t-SNE, Clearly Explained

StatQuest: t-SNE, Clearly Explained

StatQuest with Josh Starmer

This video explains the differences between RPKM, FPKM, and TPM metrics used in RNA-seq data analysis, and how to apply them to normalize read counts.

Key Takeaways

Calculate RPKM values by normalizing read counts for sequencing depth and gene length
Calculate FPKM values by keeping track of fragments in paired-end RNA-seq
Calculate TPM values by normalizing for gene length and then sequencing depth
Compare the results of RPKM and TPM to understand the differences

💡 TPM is a more appropriate metric than RPKM because it allows for easier comparison of relative proportions of reads between samples.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling