The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
Key Takeaways
The video discusses the importance of the number of parameters in deep learning models, highlighting the efficiency misnomer and the need for more comprehensive metrics like FLOPs and speed/throughput, referencing research papers such as Dehghani et al.
Full Transcript
hello there let's suppose you are microsoft and you are releasing a huge and awesome model how do you make it obvious it is the best the biggest the real thing correct you compare your numbers of parameters against the literature make a beautiful plot where you are the top of the line so congrats now everybody is hailing you like you're the best of the best and you might as well be miss coffee bean is not here to claim otherwise but what does the number of parameters tell us about a model anyway we came across a paper that tells us about a bigger picture it argues that the number of parameters alone is not enough to compare against other models and that reporting one number only about a model can be highly misleading when it comes to describing its efficiency in other words a high number of parameters does not necessarily mean that the model has high capacity nor that it is slow so get ready to get your mind enlarged as we welcome you to this ai coffee break thanks to alef alva for sponsoring this video stick around until the end of the video to learn more about them model comparisons what is this all about well we need to measure scientific progress in deep learning one line of doing so is to measure the model's quality on specific tasks or benchmarks and to measure how much of a sensible output the model delivers but we do not talk about this today nor does this paper because today is all about model efficiency which relates to how much training a model costs how much one has to pay to run inference for many users and how long a user has to wait until the model delivers the answer to describe model efficiency the literature uses things like we cite the number of trainable parameters the number of floating point operations and speed throughput but the point of this paper is that if one chooses the right cost indicators one can create a more favorable picture for the model's efficiency than it is in reality and that model comparisons relying on only some of these metrics can result in unfair and incomplete comparisons this is what the authors call the efficiency misnomer in other words to avoid the current situation where people are just choosing the metric they shine at and ignore the others the authors propose that well all metrics should be reported when it comes to comparing architectures in general but what are these metrics and why does one measure alone not suffice to describe the computational cost of the model why is miss coffee bean secret model of 530 billion and one parameter not automatically the best of the best and also not necessarily the slowest since it reportedly but don't fact check this has the most parameters in the world well let's look at the efficiency measure of floating point operations or flops in short that is frequently used to measure the computational cost in literature the more operations you have to do the more computations there need to be executed by a piece of hardware but as the authors point out we cite a model with low flops may not actually be fast given that flops does not take into account information such as the degree of parallelism or hardware related details like the cost of a memory access so supposing that model a has a lower number of flops than the model b it might as well be that model b is not more efficient because the floating point operations are not executed in sequence like in a recurrent neural network but in parallel all at the same time like in a transformer or even more maybe model b is better suited for the hardware it runs on and for example memory access is faster for model b than for model a so even though model a has lower flops than b model b could be much more computationally efficient then let's look at the number of trainable parameters in a model which is also a widely used measure for efficiency but the others here draw the attention to the fact that very few trainable parameters can still be very slow especially when the same parameters are used in the same computation like it is the case with convolutions in cnns and with the feed forward layers in transformers where the same feed forward network so the same parameters are used for each token as we discussed in a previous video so while the number of parameters from the fully connected layer does not increase with the sequence length for transformers the compute and memory definitely do because all these outputs for each feed forward layer have to be computed and stored in memory the authors highlight that the number of parameters of a model is a good proxy for knowing whether the model fits in the memory but cannot be a cost indicator or not alone also the number of parameters is often used to imply the capacity of the model in the sense of its expressive power of capturing complex function and dependencies between input and output which is not necessarily true because parameter sharing which reduces the effective number of parameters does not necessarily decrease the model expressivity just think about it if we would remove the parameter sharing in transformers and each token would have its own feed forward neural network with its own parameters which would make the number of parameters increase linearly with the sequence length would that increase the accuracy of the model and would that model necessarily learn a better function no it wouldn't not necessarily i mean we would have n position specific feed forward neural networks instead of just one but it's questionable that this would model anything better or that it would even converge with the data amounts we have currently because we would have removed an important inductive bias of the transformer which is that every token is treated the same modulo what the attention scores dynamically computed beforehand and yes it does look like increasing the number of parameters also increases the model accuracy and quality but is again it is just an impression we get because the increase of number of parameters usually comes with an increase in training data and with some specific architecture changes so no it's not just about the number of parameters but also about the architecture and its inductive biases the data again the data and then all the training gimmicks of course other metrics like throughput and speed which relate to how much a user has to wait for the model to give an answer strongly depend on hardware and sometimes simply on the skill of the programmer who implemented the whole thing and does not necessarily speak of an architecture on itself so this measure again if used on its own is not a fair comparison between models and let's follow the paper a little closer and dive into the discrepancy between training and inference scenarios that adds a whole new dimension to the discussion depending on context training inference or even both can be more important to look at for example miss coffee bean might not have the data the compute or enough money to pay the electricity bill of the tpus and gpus that microsoft and google and other likes have to train a huge model on billions of data samples but she does have enough vram to load the model and find unit on more tasks this is one of the appealing aspects of the pre-train and fine-tune paradigm that one general purpose model is trained just once and the environmental damage is done only once and this training cost we cite can be very small compared to the inference cost when a model is deployed to be used by many many users but in cases where retraining has to be done very often like it is the case with recommender systems that have to be kept up to date with the user preferences and available content at any moment the training cost can become very important again on the other hand if the inference efficiency is the bottleneck because the deep learning based application runs on mobile phones then the inference speed is very important so we have seen now how different measures can disagree when it comes to describing efficiency for the same model and that one measure of efficiency alone can be highly misleading also the number of parameters alone is not a measure for much else than how large a storage the model needs and that the more efficiency metrics a paper reports the better especially to be sure that nobody misses out on the unfavorable metrics and speaking of companies that have the means to crunch the numbers for really large models let's introduce our sponsor alef alpha alef alpha's vision is quite simple to be the leading european company researching and creating next generation strong artificial intelligence think of it as europe's open ai but then forget this again as it puts a european twist to it i love this a little bit of competition in ai from europe will benefit everyone no aleph alpha's european approach and data sets have an impact beyond poor language understanding this means they're going multimodal so what they do best at lf alpha is research and development in a.i with the european focus they have a diverse team of currently more than 30 senior experts from all relevant fields working relentlessly to create truly transformative ai technology which sure leads to new ways for human machine collaboration going of script now i know some of the people working there and they are great from research to development to implementation ai must benefit all of society and that's why lf alpha strives to align modern generalizable ai research sustainably with ethical values and to achieve that they search for the best talent and partners so if you're looking for a job make sure to check out alif alpha for more information visit their website or follow them on twitter see the links in the description below thanks for watching this episode and if you enjoyed this do not forget to tell your own coffee beans about this channel [Music]
Original Description
How important is the number of parameters in deep learning models? But what about other measures like FLOPs or speed/throughput?
► Check out our sponsor Aleph Alpha 👉 https://www.aleph-alpha.de/ !
Follow them on Twitter: Aleph__Alpha
Paper 📜:
Dehghani, Mostafa, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. "The Efficiency Misnomer." arXiv preprint arXiv:2110.12894 (2021). https://arxiv.org/abs/2110.12894
🔗 Megatron-Turing NLG 530B: https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
donor, Dres. Trost GbR, Yannik Schneider
➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/
Outline:
00:00 Model efficiency comparison
02:51 FLOPs
03:55 Number of parameters: means what?
06:31 Speed / throughput
09:39 Aleph Alpha (Sponsor)
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: https://www.patreon.com/AICoffeeBreak
Ko-fi: https://ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: https://www.youtube.com/c/AICoffeeBreak/community
Twitter: https://twitter.com/AICoffeeBreak
Reddit: https://www.reddit.com/r/AICoffeeBreak/
YouTube: https://www.youtube.com/AICoffeeBreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 60 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
▶
AI Coffee Break - Channel Trailer
AI Coffee Break with Letitia
How to check if a neural network has learned a specific phenomenon?
AI Coffee Break with Letitia
A brief history of the Transformer architecture in NLP
AI Coffee Break with Letitia
Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
AI Coffee Break with Letitia
The Transformer neural network architecture EXPLAINED. “Attention is all you need”
AI Coffee Break with Letitia
Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
AI Coffee Break with Letitia
Pre-training of BERT-based Transformer architectures explained – language and vision!
AI Coffee Break with Letitia
GPT-3 explained with examples. Possibilities, and implications.
AI Coffee Break with Letitia
Adversarial Machine Learning explained! | With examples.
AI Coffee Break with Letitia
BERTology meets Biology | Solving biological problems with Transformers
AI Coffee Break with Letitia
Can a neural network tell if an image is mirrored? – Visual Chirality
AI Coffee Break with Letitia
The ultimate intro to Graph Neural Networks. Maybe.
AI Coffee Break with Letitia
Can language models understand? Bender and Koller argument.
AI Coffee Break with Letitia
GANs explained | Generative Adversarial Networks video with showcase!
AI Coffee Break with Letitia
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
AI Coffee Break with Letitia
Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
AI Coffee Break with Letitia
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
AI Coffee Break with Letitia
An image is worth 16x16 words: ViT | Vision Transformer explained
AI Coffee Break with Letitia
AI understanding language!? A roadmap to natural language understanding.
AI Coffee Break with Letitia
"What Can We Do to Improve Peer Review in NLP?" 👀
AI Coffee Break with Letitia
The curse of dimensionality. Or is it a blessing?
AI Coffee Break with Letitia
PCA explained with intuition, a little math and code
AI Coffee Break with Letitia
Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
AI Coffee Break with Letitia
OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
AI Coffee Break with Letitia
Leaking training data from GPT-2. How is this possible?
AI Coffee Break with Letitia
OpenAI’s CLIP explained! | Examples, links to code and pretrained model
AI Coffee Break with Letitia
Transformers can do both images and text. Here is why.
AI Coffee Break with Letitia
UMAP explained | The best dimensionality reduction?
AI Coffee Break with Letitia
NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
AI Coffee Break with Letitia
Transformer in Transformer: Paper explained and visualized | TNT
AI Coffee Break with Letitia
[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
AI Coffee Break with Letitia
Pattern Exploiting Training explained! | PET, iPET, ADAPET
AI Coffee Break with Letitia
Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
AI Coffee Break with Letitia
FNet: Mixing Tokens with Fourier Transforms – Paper Explained
AI Coffee Break with Letitia
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
AI Coffee Break with Letitia
"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
AI Coffee Break with Letitia
Scaling Vision Transformers? How much data can a transformer get? #Shorts
AI Coffee Break with Letitia
How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
AI Coffee Break with Letitia
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
AI Coffee Break with Letitia
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
AI Coffee Break with Letitia
Adding vs. concatenating positional embeddings & Learned positional encodings
AI Coffee Break with Letitia
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia
Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
AI Coffee Break with Letitia
Is today's AI smarter than YOU? #Shorts
AI Coffee Break with Letitia
Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
AI Coffee Break with Letitia
Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
What is tokenization and how does it work? Tokenizers explained.
AI Coffee Break with Letitia
Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
AI Coffee Break with Letitia
How modern search engines work – Vector databases explained! | Weaviate open-source
AI Coffee Break with Letitia
Eyes tell all: How to tell that an AI generated a face?
AI Coffee Break with Letitia
Swin Transformer paper animated and explained
AI Coffee Break with Letitia
Data BAD | What Will it Take to Fix Benchmarking for NLU?
AI Coffee Break with Letitia
SimVLM explained | What the paper doesn’t tell you
AI Coffee Break with Letitia
Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
AI Coffee Break with Letitia
Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
AI Coffee Break with Letitia
The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
AI Coffee Break with Letitia
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
Chapters (5)
Model efficiency comparison
2:51
FLOPs
3:55
Number of parameters: means what?
6:31
Speed / throughput
9:39
Aleph Alpha (Sponsor)
🎓
Tutor Explanation
DeepCamp AI