The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia · Advanced ·📄 Research Papers Explained ·4y ago

Skills: Reading ML Papers90%Research Methods80%Paper Reproduction70%

Key Takeaways

The video discusses the importance of the number of parameters in deep learning models, highlighting the efficiency misnomer and the need for more comprehensive metrics like FLOPs and speed/throughput, referencing research papers such as Dehghani et al.

Full Transcript

hello there let's suppose you are microsoft and you are releasing a huge and awesome model how do you make it obvious it is the best the biggest the real thing correct you compare your numbers of parameters against the literature make a beautiful plot where you are the top of the line so congrats now everybody is hailing you like you're the best of the best and you might as well be miss coffee bean is not here to claim otherwise but what does the number of parameters tell us about a model anyway we came across a paper that tells us about a bigger picture it argues that the number of parameters alone is not enough to compare against other models and that reporting one number only about a model can be highly misleading when it comes to describing its efficiency in other words a high number of parameters does not necessarily mean that the model has high capacity nor that it is slow so get ready to get your mind enlarged as we welcome you to this ai coffee break thanks to alef alva for sponsoring this video stick around until the end of the video to learn more about them model comparisons what is this all about well we need to measure scientific progress in deep learning one line of doing so is to measure the model's quality on specific tasks or benchmarks and to measure how much of a sensible output the model delivers but we do not talk about this today nor does this paper because today is all about model efficiency which relates to how much training a model costs how much one has to pay to run inference for many users and how long a user has to wait until the model delivers the answer to describe model efficiency the literature uses things like we cite the number of trainable parameters the number of floating point operations and speed throughput but the point of this paper is that if one chooses the right cost indicators one can create a more favorable picture for the model's efficiency than it is in reality and that model comparisons relying on only some of these metrics can result in unfair and incomplete comparisons this is what the authors call the efficiency misnomer in other words to avoid the current situation where people are just choosing the metric they shine at and ignore the others the authors propose that well all metrics should be reported when it comes to comparing architectures in general but what are these metrics and why does one measure alone not suffice to describe the computational cost of the model why is miss coffee bean secret model of 530 billion and one parameter not automatically the best of the best and also not necessarily the slowest since it reportedly but don't fact check this has the most parameters in the world well let's look at the efficiency measure of floating point operations or flops in short that is frequently used to measure the computational cost in literature the more operations you have to do the more computations there need to be executed by a piece of hardware but as the authors point out we cite a model with low flops may not actually be fast given that flops does not take into account information such as the degree of parallelism or hardware related details like the cost of a memory access so supposing that model a has a lower number of flops than the model b it might as well be that model b is not more efficient because the floating point operations are not executed in sequence like in a recurrent neural network but in parallel all at the same time like in a transformer or even more maybe model b is better suited for the hardware it runs on and for example memory access is faster for model b than for model a so even though model a has lower flops than b model b could be much more computationally efficient then let's look at the number of trainable parameters in a model which is also a widely used measure for efficiency but the others here draw the attention to the fact that very few trainable parameters can still be very slow especially when the same parameters are used in the same computation like it is the case with convolutions in cnns and with the feed forward layers in transformers where the same feed forward network so the same parameters are used for each token as we discussed in a previous video so while the number of parameters from the fully connected layer does not increase with the sequence length for transformers the compute and memory definitely do because all these outputs for each feed forward layer have to be computed and stored in memory the authors highlight that the number of parameters of a model is a good proxy for knowing whether the model fits in the memory but cannot be a cost indicator or not alone also the number of parameters is often used to imply the capacity of the model in the sense of its expressive power of capturing complex function and dependencies between input and output which is not necessarily true because parameter sharing which reduces the effective number of parameters does not necessarily decrease the model expressivity just think about it if we would remove the parameter sharing in transformers and each token would have its own feed forward neural network with its own parameters which would make the number of parameters increase linearly with the sequence length would that increase the accuracy of the model and would that model necessarily learn a better function no it wouldn't not necessarily i mean we would have n position specific feed forward neural networks instead of just one but it's questionable that this would model anything better or that it would even converge with the data amounts we have currently because we would have removed an important inductive bias of the transformer which is that every token is treated the same modulo what the attention scores dynamically computed beforehand and yes it does look like increasing the number of parameters also increases the model accuracy and quality but is again it is just an impression we get because the increase of number of parameters usually comes with an increase in training data and with some specific architecture changes so no it's not just about the number of parameters but also about the architecture and its inductive biases the data again the data and then all the training gimmicks of course other metrics like throughput and speed which relate to how much a user has to wait for the model to give an answer strongly depend on hardware and sometimes simply on the skill of the programmer who implemented the whole thing and does not necessarily speak of an architecture on itself so this measure again if used on its own is not a fair comparison between models and let's follow the paper a little closer and dive into the discrepancy between training and inference scenarios that adds a whole new dimension to the discussion depending on context training inference or even both can be more important to look at for example miss coffee bean might not have the data the compute or enough money to pay the electricity bill of the tpus and gpus that microsoft and google and other likes have to train a huge model on billions of data samples but she does have enough vram to load the model and find unit on more tasks this is one of the appealing aspects of the pre-train and fine-tune paradigm that one general purpose model is trained just once and the environmental damage is done only once and this training cost we cite can be very small compared to the inference cost when a model is deployed to be used by many many users but in cases where retraining has to be done very often like it is the case with recommender systems that have to be kept up to date with the user preferences and available content at any moment the training cost can become very important again on the other hand if the inference efficiency is the bottleneck because the deep learning based application runs on mobile phones then the inference speed is very important so we have seen now how different measures can disagree when it comes to describing efficiency for the same model and that one measure of efficiency alone can be highly misleading also the number of parameters alone is not a measure for much else than how large a storage the model needs and that the more efficiency metrics a paper reports the better especially to be sure that nobody misses out on the unfavorable metrics and speaking of companies that have the means to crunch the numbers for really large models let's introduce our sponsor alef alpha alef alpha's vision is quite simple to be the leading european company researching and creating next generation strong artificial intelligence think of it as europe's open ai but then forget this again as it puts a european twist to it i love this a little bit of competition in ai from europe will benefit everyone no aleph alpha's european approach and data sets have an impact beyond poor language understanding this means they're going multimodal so what they do best at lf alpha is research and development in a.i with the european focus they have a diverse team of currently more than 30 senior experts from all relevant fields working relentlessly to create truly transformative ai technology which sure leads to new ways for human machine collaboration going of script now i know some of the people working there and they are great from research to development to implementation ai must benefit all of society and that's why lf alpha strives to align modern generalizable ai research sustainably with ethical values and to achieve that they search for the best talent and partners so if you're looking for a job make sure to check out alif alpha for more information visit their website or follow them on twitter see the links in the description below thanks for watching this episode and if you enjoyed this do not forget to tell your own coffee beans about this channel [Music]

Original Description

How important is the number of parameters in deep learning models? But what about other measures like FLOPs or speed/throughput? ► Check out our sponsor Aleph Alpha 👉 https://www.aleph-alpha.de/ ! Follow them on Twitter: Aleph__Alpha Paper 📜: Dehghani, Mostafa, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. "The Efficiency Misnomer." arXiv preprint arXiv:2110.12894 (2021). https://arxiv.org/abs/2110.12894 🔗 Megatron-Turing NLG 530B: https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏 donor, Dres. Trost GbR, Yannik Schneider ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ Outline: 00:00 Model efficiency comparison 02:51 FLOPs 03:55 Number of parameters: means what? 06:31 Speed / throughput 09:39 Aleph Alpha (Sponsor) ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕ Patreon: https://www.patreon.com/AICoffeeBreak Ko-fi: https://ko-fi.com/aicoffeebreak ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🔗 Links: AICoffeeBreakQuiz: https://www.youtube.com/c/AICoffeeBreak/community Twitter: https://twitter.com/AICoffeeBreak Reddit: https://www.reddit.com/r/AICoffeeBreak/ YouTube: https://www.youtube.com/AICoffeeBreak #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 60 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

The video teaches the importance of considering multiple metrics when evaluating model efficiency, beyond just the number of parameters, and highlights the potential for misleading comparisons. It also introduces the concept of the efficiency misnomer and the need for more comprehensive understanding of model efficiency. The video is sponsored by Aleph Alpha, a company researching next-generation strong AI.

Key Takeaways

Read research papers critically to understand model efficiency metrics
Evaluate the number of parameters in relation to other metrics like FLOPs and speed/throughput
Design experiments to compare model efficiency using multiple metrics
Analyze results from multiple metrics to avoid the efficiency misnomer
Implement models with varying parameter counts to test efficiency
Consider the computational cost and floating point operations when evaluating model efficiency

💡 The number of parameters alone is not a reliable indicator of model efficiency, and more comprehensive metrics like FLOPs and speed/throughput should be considered to avoid the efficiency misnomer.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Chapters (5)

Model efficiency comparison

2:51 FLOPs

3:55 Number of parameters: means what?

6:31 Speed / throughput

9:39 Aleph Alpha (Sponsor)

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling