LLM (Parameter Efficient) Fine Tuning - Explained!
Key Takeaways
The video discusses parameter efficient fine tuning for large language models (LLMs), specifically focusing on the Transformer neural network architecture and its application in natural language processing tasks, such as question answering. It highlights the importance of reducing the number of trainable parameters and memory usage while maintaining comparable performance to full fine tuning.
Full Transcript
greetings fellow Learners now before we get into this Fantastical world of fine-tuning I have a thought-provoking question for you do you think AI has peaked in its capabilities or is there still more to come now in my case while I do think that AI is sometimes hyped for the wrong reasons I am cautiously optimistic and I do think that there is scope for AI to get better at learning and hopefully we were going to reach a state where AI is going to be more performant while also being safe to use but that's my take now turning this question over to you do you think that AI has peaked in its capabilities or is there still more to come comment your thoughts down below and I would love to hear them now this video is going to be divided into a few passes where we're going to illustrate the what's the wise in the house of parameter efficient fine tuning so let's get to it in order to explain why parameter efficient fine-tuning exists I'm going to rewind the clock and try to get an explanation from the timeline of NLP progress so that we can see why it exists so let's start with recurrent neural networks recurrent neural networks in around 2016 were the state-ofthe-art for language problems and specifically sequence to sequence problems sequence are an ordered set of tokens so we would provide for example if we wanted to train it for language modeling we would sequentially pass in some tokens and it would generate the next set of tokens this is language modeling now an issue with this though is that data processing is sequential you have to pass in words one at a time and because they are sequential they don't make use of modern gpus very well another issue is that training is very slow and and it's so slow that it uses a truncated version of back propagation in order to train recurrent neural networks now in order to solve these issues we have the Transformer neural network that was introduced these Transformer neural networks are encoder decoder architectures that make use of attention now an example of like how it would solve language modeling is that we pass in all of the input words to the encoder simultaneously the encoder will generate vectors for each word that has some meaning encoded into it so this is I love going these are then passed into the decoder along with the contextual words in order to generate the next words one at a time now a pro of this architecture is that the input can be processed in parallel and hence can make use of gpus now a con here is that every task requires a lot of data for training if we want to train it from scratch so we saw it for language modeling but if you wanted to train it for question answering for example you need a lot of data from scratch so how do we deal with this data problem I'll give you a second to guess that's right you guessed it it's transfer learning so transfer learning involves taking an untrained model and we're going to train it on a baseline language task in this case it's going to be language modeling where we are going to feed it multiple examples in order to finally train this model and so this model now becomes a pre-trained model on language modeling next we take this pre-trained model and we're going to fine-tune it on a specific task and in this case what we're going to do is we're going to fine-tune it on question answering so we'll feed it a question and it will generate a response to said question and we're going to feed it multiple examples until the model is finally fine-tuned to answer questions and this kind of pre-training and fine-tuning architecture using transfer learning is the basis for language models today including Bert so we have a pre-training phase where the model is trained on mass language modeling where it's predicting words in between the sentence instead of the next word and also next sentence prediction which is predicting given two sentences does sentence B follow sentence a and then once this model is trained the model has weights these are then fine-tuned on specific tasks like for example question answering this is also used in chat GPT where we have a pre-trained language model which is then performing supervised fine-tuning on question answering so that it can better answer questions now the pros of fine tuning are that it requires less data than training a model from scratch which is great now the cons here are that it is timec consuming to train and expensive to train as llms are getting larger so every single fine-tuning task that you have you need to store all the model parameters again and again which is a lot of storage and it can take a lot of time to train and then we have an issue of catastrophic forgetting which can lead to overfitting and catastrophic forgetting is when the model when is being fine-tuned it's going to update all of its parameters such that it kind of forgets what it has learned during the pre-training phase now both of these issues are caused by the fact that all model parameters are trainable during the fine-tuning phase so how do we deal with the fact that all model parameters are trainable during the fine tuning phase that's right you guessed it parameter efficient fine-tuning the parameter efficient fine-tuning taken from a paper that it is an effective solution that reduces the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine tuning so this is what parameter efficient fine-tuning does and I hope the historical context also provided more context as to why it exists quiz time it's that time of video again have you been paying attention let's quiz you to find out parameter efficient fine-tuning is a set of techniques a to ensure a model does not overfit during fine tuning B to reduce the number of trainable parameters during fine tuning C to increase the model performance during fine-tuning or D to decrease the model training time during pre-training I'll give you a few seconds to answer this question the correct answer is B but can you tell me why give your reason down in the comments below and let's have a discussion and if you think I do deserve it at this point please do consider giving this video a like because it will help me out a lot that's going to do it for quiz one and pass one of this explanation but keep paying attention because I will be back to quiz you in this pass let's go through an implementation of how exactly parameter efficient fine-tuning is implemented so this here is the architecture of the Transformer neural network that we took a look at previously so I want you to just pay attention to this encoder architecture over here where we have a multi-ad attention some layer normalization feed forward networks and also a bunch of skip connections now what we're going to to do is take the entire Transformer layer and we're going to interject two adopter layers per task and these adapter layers over here are going to have an architecture that's going to be a multi-layer perceptron type architecture essentially it's just to feed forward layers now the top layers in the bottom layers here are just going to be D neurons value of D is just going to be the internal representation of words or just general tokens in the Transformer layer so for example in Bert large this could be 1,00 and 24 and there is going to be a bottleneck layer here which is M it's going to be a lot less than D this could be like 8 64 256 or something like that and overall the total number of parameters right here is going to be 2 MD plus M plus d and if you want to see like how exactly this comes about it's because for M layer of neurons and this D layer of neurons here we're going to have M * D number of Weights so that's MD weights here similarly for this up projection we're going to have MD weights over here too so that's going to be two MD weights that's how you get the first term but we also have biases so over here we're going to have like you can imagine there's going to be like a bias neuron that's going to be attached to every one of the m neur Neons so that's going to give M more weights hence plus M similarly there's going to be a bias over here which is going to be attached to all of the D neurons over here and hence we have D more weights hence D and that's how we end up with 2 md+ M plus d additional parameters and all of these parameters are going to be trainable parameters that we introduce when we want to introduce like let's say we want to train this model on question answering and it's going to be repeated twice that's just for one adapter but we have two such adapters here so I hope you can see how the setup looks now architecturally speaking now that you know how these adapters look let's actually try to see this in action so let's just say without any adapters we have the Bert architecture which we will initially start pre-training on mass language modeling Mass language modeling means that that we're going to predict the words in between the sentence kind of like filling in the blanks so if the input is I love blank to the blank the output could be going and park right here so we'll train it on mass language modeling and also pre-train it on next sentence prediction given two sentences like I love toys and the sky is blue the model is going to determine yes sentence two follows sentence one or no sentence two does not follow sentence one in a semantic sense in this case we're going to say it's it's false because these two do not follow each other now once the model is pre-trained on these two tasks we are then going to fine-tune Bert with full fine-tuning and what this inails is let's say we are fine-tuning Bert on the question answering objective we will pass in a question and we will also pass in the label which is going to be the answer to this question and and then once it looks at you know the forward pass it will generate a response but then in the backward pass we are going to perform back propagation which is going to flow through the entire network gradients propagate through the network and it will also update every single weight that exists in the network now the issue with this is that all parameters during the back propagation phase now need to be stored so in this case case of like Bert large where we are fine-tuning on question answering we have like 345 million parameters in Bert all 345 million parameters when we find 200 question answering now need to be stored somewhere and this can take up a lot of storage especially for very large language models with like billions or trillions of parameters now again let's say that we also want to fine-tune Bert on another task which is sentiment analysis so we'll take our pre-trained Bert and then we are going to say the movie was enjoyable and then the sentiment in this case is positive so that's how the training data looks so we'll have a forward pass where we can pass this in data flows through and then in the backward pass once again all of the weights are going to be updated and so we have 345 million new parameters that need to be stored somewhere else so every time we have a new task all of these parameters are going to be updated and need to be stored somewhere right and this can take up a lot of space and also for context this is like a nice chart which shows a bunch of language models along with circles with you know the size of the circle indicates how many parameters or how big it is you can see that Bert over here is like teeny tiny with 300 million parameters we have gpt2 with 1.5 billion parameters gpt3 with 175 billion parameters and then you have like some of these these very large language models which have over a trillion parameters in them so you can imagine that these like very large language models for every single time that we want to train a specific task it's going to take terabytes of data just to fine-tune on one task and this means that only some of the biggest players and the biggest companies can really only fine-tune these models and we want to make AI more accessible to everyone to fine-tune so now that we took a look at how Bert is fine-tuned let's actually take a look at how Bert is fine-tuned with an adapter so first of all Bert is already pre-trained without adapters on mass language modeling and next sentence prediction we are then going to add these adapters here so this is like basically Burt with adapters we have two adapters per Transformer layer and what we're going to do now is we're going to fine-tune this on question answering so let's just say we feed in the question we're going to also feed in the label and then we're going to train the model and in doing so in the forward pass everything is going to flow just fine but in the backward pass we are going to freeze the weights of every single layer here except for the adapter layers themselves so when I say freezy's layers it means that during the back propagation step the gradients are still going to be calculated but they're not really going to update these weights so any kind of updating the weights in the attention layers the layer normalizations the feed for layers none of them are going to be updated it's only the adapter layers that we see here have their weights that are going to be updated and because only the adapter weights are updated during the back propagation only these adapter weights need to be stored and these adapter weights like we calculated before is not really that large and so if we wanted to let's say train or fine-tune on another task first we can replace these adapters so that it's a fresh start after pre-training the language model and then when we fine-tune them we can fine-tune it on sentiment analysis passing in the data and the labels we have the forward pass as it occurs as usual and in the backward pass once again the model layers are frozen the gradients do propagate in the backward Direction but no weights are updated except for these adapter weights and because only adapter weights are updated during the back prop only the adopter weights need to be stored and so I hope you have a clear picture on how these adopters look and how they can actually reduce the number of stored parameters for every fine-tuned task quiz time it's that time of video again have you been paying attention let's quiz you to find out for a task full find tuning increases the trainable model parameters typically by around a 1% B 20% C 100% or D 200% I'll give you a few seconds to answer this question the correct answer is C but can you tell me why comment your reasoning down below and let's have a discussion and that's going to do it for quiz two and pass two of this explanation but keep paying attention because I will be back to quiz you now that we looked at how fine-tuning with these adapters work let's actually compute what the savings really is in some Quantified way in order to do that we are going to use Bert large as an example so in Bert large we have 345 million or so trainable parameters now the number of trainable parameters in full fine-tuning per task is going to be 345 million because all of the parameters are going to be updated when we fine-tune on a specific task but what is the number of training parameters trainable parameters in parameter efficient fine tuning or at least this version of parameter efficient fine tuning that we've discussed in the video video well let's actually try calculating this so the number of parameters per adapter is going to be 2 md+ M plus d like we discussed previously in this adapter and in this case we can take M which is the bottleneck layer Let It Be of size 64 that's typically some number that is around the ballpark that's chosen now D is going to be 1024 in the BT large case and 1024 is the number of dimensions of the internal vectors of Bert large so this means that every word or every token is typically represented in Bert large by 1024 dimensions and these Dimensions have to match with those dimensions for it to flow sequentially in the network so hence we have 1,24 so plugging those numbers into this equation we'll get 132,133 and Bert large has 24 such Transformer layers that are stacked sequentially and so this ends up with like the number of trainable parameters for a given fine-tuning task is going to be around 6.34 million so plugging this number here you can see that instead of training 345 million parameters we are only training 6.34 million and this means it's only 1.8% of all of these parameters so if you divide these two numbers which is a huge reduction in how much storage space is required for fine-tuning multiple tasks now if we want to get a better sense of even more numbers and exactly how performance looks this table over here will show the performance of birs large with and without the adapters so for example if we have Bert large the total number of tasks that we're training on is nine this is the glue bench Mark glue is like a benchmark for NLP tasks so that we can compare multiple models by using the same core metrics and sets of data what's really cool is here is that even with the Adaptive approaches we are achieving performances that are almost as close as the full fine-tuning with using only a fraction like 2.1% of the total number of parameters or 3.6% of the total parameters instead of adding 100% parameters per task so that shows great savings here I also want to make it clear that the version of parameter efficient fine tuning that we described currently in this video was introduced back in 2019 and it sprouted a bunch of other parameter efficient fine-tuning methods that you can see over here that can be divided into many categories quiz time o this is going to be a fun one now for a task parameter efficient fine-tuning increases the trainable model parameters typ typically by around a 1% B 20% C 100% or D 200% I'll give you a few seconds to answer this question the correct answer is a but can you tell me why comment your reasoning down below and let's have a discussion and if you think I do deserve it at this point please do consider giving this video a like because it will help me out a lot now that's going to do it for Quiz 3 and pass three of this explanation but before we go let's generate a summary now as a summary we first saw how fine tuning came about but also its cons in the sense that it's pretty timec consuming and expensive to train as llms have gotten larger and also has the problem of catastrophic forgetting and both of these are linked to the fact that all model parameters are trainable during the fine-tuning phase and one way to solve this is using parameter efficient fine-tuning which reduces the number of fine-tuning parameters in memory usage while achieving comparable performance to full fine tuning we then saw how we can add adapter layers to a Bert Network in order to just fine-tune the model while only needing to store the adopter weights and these weights are a very small percentage of the total number of Weights that we would see in full fine-tuning and while the fine-tuning approach we described was just the first one that came out in 2019 there have been many others that have also come out which we can take a look at in a future video but regardless I'll link all of these resources down in the description below so that you can check them out now thank you all so much so thank you all so much that's all that we have for today and if you think I do deserve it please do consider giving this video a like it'll help me out a lot and I will see you in the next one bye-bye
Original Description
Parameter efficient fine tuning is increasingly important in NLP and genAI. Let's talk about it.
RESOURCES
[1 📚] RNNs were the SOTA for sequence tasks: https://arxiv.org/pdf/1409.0473
[2 📚] Then transformers came on the scene: https://arxiv.org/pdf/1706.03762
[3 📚] Pretraining and Finetuning architectures like BERT came along: https://arxiv.org/pdf/1810.04805
[4 📚] But LLMs are huge: https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt/
[5 📚] Few shot learning by GPT-3 tries to address the issue: https://arxiv.org/pdf/2005.14165
[6 📚] Parameter Efficient Transfer Learning reduces the trainable parameters via additive adapters (the first PEFT technique): https://arxiv.org/pdf/1902.00751
[7 📚] Since 2019, there have been many PEFT techniques introduced: https://arxiv.org/pdf/2312.12148
[8 📚] Other notable techniques include prefix-tuning: https://arxiv.org/pdf/2101.00190
[9 📚] And LoRA: https://arxiv.org/pdf/2106.09685
[10 📚] And a quantized version of LoRA called QLoRA: https://arxiv.org/pdf/2305.14314
[11 📚] We see these adapters in use in LLMs today like Llama: https://arxiv.org/pdf/2303.16199
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
PLAYLISTS FROM MY CHANNEL
⭕ Deep Learning 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_NwyY_PeSYrYfsvHZnHGPU
⭕ Natural Language Processing 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ Reinforcement Learning 101: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing 101: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Dev.to · Vladimir Panov
🎓
Tutor Explanation
DeepCamp AI