LLM (Parameter Efficient) Fine Tuning - Explained!

CodeEmporium · Advanced ·🧠 Large Language Models ·1y ago

Skills: LLM Foundations90%Fine-tuning LLMs90%

Key Takeaways

The video discusses parameter efficient fine tuning for large language models (LLMs), specifically focusing on the Transformer neural network architecture and its application in natural language processing tasks, such as question answering. It highlights the importance of reducing the number of trainable parameters and memory usage while maintaining comparable performance to full fine tuning.

Full Transcript

greetings fellow Learners now before we get into this Fantastical world of fine-tuning I have a thought-provoking question for you do you think AI has peaked in its capabilities or is there still more to come now in my case while I do think that AI is sometimes hyped for the wrong reasons I am cautiously optimistic and I do think that there is scope for AI to get better at learning and hopefully we were going to reach a state where AI is going to be more performant while also being safe to use but that's my take now turning this question over to you do you think that AI has peaked in its capabilities or is there still more to come comment your thoughts down below and I would love to hear them now this video is going to be divided into a few passes where we're going to illustrate the what's the wise in the house of parameter efficient fine tuning so let's get to it in order to explain why parameter efficient fine-tuning exists I'm going to rewind the clock and try to get an explanation from the timeline of NLP progress so that we can see why it exists so let's start with recurrent neural networks recurrent neural networks in around 2016 were the state-ofthe-art for language problems and specifically sequence to sequence problems sequence are an ordered set of tokens so we would provide for example if we wanted to train it for language modeling we would sequentially pass in some tokens and it would generate the next set of tokens this is language modeling now an issue with this though is that data processing is sequential you have to pass in words one at a time and because they are sequential they don't make use of modern gpus very well another issue is that training is very slow and and it's so slow that it uses a truncated version of back propagation in order to train recurrent neural networks now in order to solve these issues we have the Transformer neural network that was introduced these Transformer neural networks are encoder decoder architectures that make use of attention now an example of like how it would solve language modeling is that we pass in all of the input words to the encoder simultaneously the encoder will generate vectors for each word that has some meaning encoded into it so this is I love going these are then passed into the decoder along with the contextual words in order to generate the next words one at a time now a pro of this architecture is that the input can be processed in parallel and hence can make use of gpus now a con here is that every task requires a lot of data for training if we want to train it from scratch so we saw it for language modeling but if you wanted to train it for question answering for example you need a lot of data from scratch so how do we deal with this data problem I'll give you a second to guess that's right you guessed it it's transfer learning so transfer learning involves taking an untrained model and we're going to train it on a baseline language task in this case it's going to be language modeling where we are going to feed it multiple examples in order to finally train this model and so this model now becomes a pre-trained model on language modeling next we take this pre-trained model and we're going to fine-tune it on a specific task and in this case what we're going to do is we're going to fine-tune it on question answering so we'll feed it a question and it will generate a response to said question and we're going to feed it multiple examples until the model is finally fine-tuned to answer questions and this kind of pre-training and fine-tuning architecture using transfer learning is the basis for language models today including Bert so we have a pre-training phase where the model is trained on mass language modeling where it's predicting words in between the sentence instead of the next word and also next sentence prediction which is predicting given two sentences does sentence B follow sentence a and then once this model is trained the model has weights these are then fine-tuned on specific tasks like for example question answering this is also used in chat GPT where we have a pre-trained language model which is then performing supervised fine-tuning on question answering so that it can better answer questions now the pros of fine tuning are that it requires less data than training a model from scratch which is great now the cons here are that it is timec consuming to train and expensive to train as llms are getting larger so every single fine-tuning task that you have you need to store all the model parameters again and again which is a lot of storage and it can take a lot of time to train and then we have an issue of catastrophic forgetting which can lead to overfitting and catastrophic forgetting is when the model when is being fine-tuned it's going to update all of its parameters such that it kind of forgets what it has learned during the pre-training phase now both of these issues are caused by the fact that all model parameters are trainable during the fine-tuning phase so how do we deal with the fact that all model parameters are trainable during the fine tuning phase that's right you guessed it parameter efficient fine-tuning the parameter efficient fine-tuning taken from a paper that it is an effective solution that reduces the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine tuning so this is what parameter efficient fine-tuning does and I hope the historical context also provided more context as to why it exists quiz time it's that time of video again have you been paying attention let's quiz you to find out parameter efficient fine-tuning is a set of techniques a to ensure a model does not overfit during fine tuning B to reduce the number of trainable parameters during fine tuning C to increase the model performance during fine-tuning or D to decrease the model training time during pre-training I'll give you a few seconds to answer this question the correct answer is B but can you tell me why give your reason down in the comments below and let's have a discussion and if you think I do deserve it at this point please do consider giving this video a like because it will help me out a lot that's going to do it for quiz one and pass one of this explanation but keep paying attention because I will be back to quiz you in this pass let's go through an implementation of how exactly parameter efficient fine-tuning is implemented so this here is the architecture of the Transformer neural network that we took a look at previously so I want you to just pay attention to this encoder architecture over here where we have a multi-ad attention some layer normalization feed forward networks and also a bunch of skip connections now what we're going to to do is take the entire Transformer layer and we're going to interject two adopter layers per task and these adapter layers over here are going to have an architecture that's going to be a multi-layer perceptron type architecture essentially it's just to feed forward layers now the top layers in the bottom layers here are just going to be D neurons value of D is just going to be the internal representation of words or just general tokens in the Transformer layer so for example in Bert large this could be 1,00 and 24 and there is going to be a bottleneck layer here which is M it's going to be a lot less than D this could be like 8 64 256 or something like that and overall the total number of parameters right here is going to be 2 MD plus M plus d and if you want to see like how exactly this comes about it's because for M layer of neurons and this D layer of neurons here we're going to have M * D number of Weights so that's MD weights here similarly for this up projection we're going to have MD weights over here too so that's going to be two MD weights that's how you get the first term but we also have biases so over here we're going to have like you can imagine there's going to be like a bias neuron that's going to be attached to every one of the m neur Neons so that's going to give M more weights hence plus M similarly there's going to be a bias over here which is going to be attached to all of the D neurons over here and hence we have D more weights hence D and that's how we end up with 2 md+ M plus d additional parameters and all of these parameters are going to be trainable parameters that we introduce when we want to introduce like let's say we want to train this model on question answering and it's going to be repeated twice that's just for one adapter but we have two such adapters here so I hope you can see how the setup looks now architecturally speaking now that you know how these adapters look let's actually try to see this in action so let's just say without any adapters we have the Bert architecture which we will initially start pre-training on mass language modeling Mass language modeling means that that we're going to predict the words in between the sentence kind of like filling in the blanks so if the input is I love blank to the blank the output could be going and park right here so we'll train it on mass language modeling and also pre-train it on next sentence prediction given two sentences like I love toys and the sky is blue the model is going to determine yes sentence two follows sentence one or no sentence two does not follow sentence one in a semantic sense in this case we're going to say it's it's false because these two do not follow each other now once the model is pre-trained on these two tasks we are then going to fine-tune Bert with full fine-tuning and what this inails is let's say we are fine-tuning Bert on the question answering objective we will pass in a question and we will also pass in the label which is going to be the answer to this question and and then once it looks at you know the forward pass it will generate a response but then in the backward pass we are going to perform back propagation which is going to flow through the entire network gradients propagate through the network and it will also update every single weight that exists in the network now the issue with this is that all parameters during the back propagation phase now need to be stored so in this case case of like Bert large where we are fine-tuning on question answering we have like 345 million parameters in Bert all 345 million parameters when we find 200 question answering now need to be stored somewhere and this can take up a lot of storage especially for very large language models with like billions or trillions of parameters now again let's say that we also want to fine-tune Bert on another task which is sentiment analysis so we'll take our pre-trained Bert and then we are going to say the movie was enjoyable and then the sentiment in this case is positive so that's how the training data looks so we'll have a forward pass where we can pass this in data flows through and then in the backward pass once again all of the weights are going to be updated and so we have 345 million new parameters that need to be stored somewhere else so every time we have a new task all of these parameters are going to be updated and need to be stored somewhere right and this can take up a lot of space and also for context this is like a nice chart which shows a bunch of language models along with circles with you know the size of the circle indicates how many parameters or how big it is you can see that Bert over here is like teeny tiny with 300 million parameters we have gpt2 with 1.5 billion parameters gpt3 with 175 billion parameters and then you have like some of these these very large language models which have over a trillion parameters in them so you can imagine that these like very large language models for every single time that we want to train a specific task it's going to take terabytes of data just to fine-tune on one task and this means that only some of the biggest players and the biggest companies can really only fine-tune these models and we want to make AI more accessible to everyone to fine-tune so now that we took a look at how Bert is fine-tuned let's actually take a look at how Bert is fine-tuned with an adapter so first of all Bert is already pre-trained without adapters on mass language modeling and next sentence prediction we are then going to add these adapters here so this is like basically Burt with adapters we have two adapters per Transformer layer and what we're going to do now is we're going to fine-tune this on question answering so let's just say we feed in the question we're going to also feed in the label and then we're going to train the model and in doing so in the forward pass everything is going to flow just fine but in the backward pass we are going to freeze the weights of every single layer here except for the adapter layers themselves so when I say freezy's layers it means that during the back propagation step the gradients are still going to be calculated but they're not really going to update these weights so any kind of updating the weights in the attention layers the layer normalizations the feed for layers none of them are going to be updated it's only the adapter layers that we see here have their weights that are going to be updated and because only the adapter weights are updated during the back propagation only these adapter weights need to be stored and these adapter weights like we calculated before is not really that large and so if we wanted to let's say train or fine-tune on another task first we can replace these adapters so that it's a fresh start after pre-training the language model and then when we fine-tune them we can fine-tune it on sentiment analysis passing in the data and the labels we have the forward pass as it occurs as usual and in the backward pass once again the model layers are frozen the gradients do propagate in the backward Direction but no weights are updated except for these adapter weights and because only adapter weights are updated during the back prop only the adopter weights need to be stored and so I hope you have a clear picture on how these adopters look and how they can actually reduce the number of stored parameters for every fine-tuned task quiz time it's that time of video again have you been paying attention let's quiz you to find out for a task full find tuning increases the trainable model parameters typically by around a 1% B 20% C 100% or D 200% I'll give you a few seconds to answer this question the correct answer is C but can you tell me why comment your reasoning down below and let's have a discussion and that's going to do it for quiz two and pass two of this explanation but keep paying attention because I will be back to quiz you now that we looked at how fine-tuning with these adapters work let's actually compute what the savings really is in some Quantified way in order to do that we are going to use Bert large as an example so in Bert large we have 345 million or so trainable parameters now the number of trainable parameters in full fine-tuning per task is going to be 345 million because all of the parameters are going to be updated when we fine-tune on a specific task but what is the number of training parameters trainable parameters in parameter efficient fine tuning or at least this version of parameter efficient fine tuning that we've discussed in the video video well let's actually try calculating this so the number of parameters per adapter is going to be 2 md+ M plus d like we discussed previously in this adapter and in this case we can take M which is the bottleneck layer Let It Be of size 64 that's typically some number that is around the ballpark that's chosen now D is going to be 1024 in the BT large case and 1024 is the number of dimensions of the internal vectors of Bert large so this means that every word or every token is typically represented in Bert large by 1024 dimensions and these Dimensions have to match with those dimensions for it to flow sequentially in the network so hence we have 1,24 so plugging those numbers into this equation we'll get 132,133 and Bert large has 24 such Transformer layers that are stacked sequentially and so this ends up with like the number of trainable parameters for a given fine-tuning task is going to be around 6.34 million so plugging this number here you can see that instead of training 345 million parameters we are only training 6.34 million and this means it's only 1.8% of all of these parameters so if you divide these two numbers which is a huge reduction in how much storage space is required for fine-tuning multiple tasks now if we want to get a better sense of even more numbers and exactly how performance looks this table over here will show the performance of birs large with and without the adapters so for example if we have Bert large the total number of tasks that we're training on is nine this is the glue bench Mark glue is like a benchmark for NLP tasks so that we can compare multiple models by using the same core metrics and sets of data what's really cool is here is that even with the Adaptive approaches we are achieving performances that are almost as close as the full fine-tuning with using only a fraction like 2.1% of the total number of parameters or 3.6% of the total parameters instead of adding 100% parameters per task so that shows great savings here I also want to make it clear that the version of parameter efficient fine tuning that we described currently in this video was introduced back in 2019 and it sprouted a bunch of other parameter efficient fine-tuning methods that you can see over here that can be divided into many categories quiz time o this is going to be a fun one now for a task parameter efficient fine-tuning increases the trainable model parameters typ typically by around a 1% B 20% C 100% or D 200% I'll give you a few seconds to answer this question the correct answer is a but can you tell me why comment your reasoning down below and let's have a discussion and if you think I do deserve it at this point please do consider giving this video a like because it will help me out a lot now that's going to do it for Quiz 3 and pass three of this explanation but before we go let's generate a summary now as a summary we first saw how fine tuning came about but also its cons in the sense that it's pretty timec consuming and expensive to train as llms have gotten larger and also has the problem of catastrophic forgetting and both of these are linked to the fact that all model parameters are trainable during the fine-tuning phase and one way to solve this is using parameter efficient fine-tuning which reduces the number of fine-tuning parameters in memory usage while achieving comparable performance to full fine tuning we then saw how we can add adapter layers to a Bert Network in order to just fine-tune the model while only needing to store the adopter weights and these weights are a very small percentage of the total number of Weights that we would see in full fine-tuning and while the fine-tuning approach we described was just the first one that came out in 2019 there have been many others that have also come out which we can take a look at in a future video but regardless I'll link all of these resources down in the description below so that you can check them out now thank you all so much so thank you all so much that's all that we have for today and if you think I do deserve it please do consider giving this video a like it'll help me out a lot and I will see you in the next one bye-bye

Original Description

Parameter efficient fine tuning is increasingly important in NLP and genAI. Let's talk about it. RESOURCES [1 📚] RNNs were the SOTA for sequence tasks: https://arxiv.org/pdf/1409.0473 [2 📚] Then transformers came on the scene: https://arxiv.org/pdf/1706.03762 [3 📚] Pretraining and Finetuning architectures like BERT came along: https://arxiv.org/pdf/1810.04805 [4 📚] But LLMs are huge: https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt/ [5 📚] Few shot learning by GPT-3 tries to address the issue: https://arxiv.org/pdf/2005.14165 [6 📚] Parameter Efficient Transfer Learning reduces the trainable parameters via additive adapters (the first PEFT technique): https://arxiv.org/pdf/1902.00751 [7 📚] Since 2019, there have been many PEFT techniques introduced: https://arxiv.org/pdf/2312.12148 [8 📚] Other notable techniques include prefix-tuning: https://arxiv.org/pdf/2101.00190 [9 📚] And LoRA: https://arxiv.org/pdf/2106.09685 [10 📚] And a quantized version of LoRA called QLoRA: https://arxiv.org/pdf/2305.14314 [11 📚] We see these adapters in use in LLMs today like Llama: https://arxiv.org/pdf/2303.16199 ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ PLAYLISTS FROM MY CHANNEL ⭕ Deep Learning 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_NwyY_PeSYrYfsvHZnHGPU ⭕ Natural Language Processing 101: https://www.youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ Reinforcement Learning 101: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing 101: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches the importance of parameter efficient fine tuning for large language models and how to implement adapter layers in Transformer neural networks to reduce trainable parameters and memory usage. It provides a comprehensive overview of the concepts and techniques involved in fine-tuning LLMs.

Key Takeaways

Train a recurrent neural network on a sequence to sequence problem
Pass input words to the encoder simultaneously in a Transformer neural network
Generate vectors for each word in the encoder
Pass vectors to the decoder to generate the next words
Train a pre-trained model on a specific task, such as question answering
Add adapter layers to a BERT network to fine-tune only the adapter weights

💡 Parameter efficient fine tuning with adapter layers can reduce the number of trainable parameters and memory usage while achieving comparable performance to full fine tuning, making it a crucial technique for optimizing LLMs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss

Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience

Medium · Machine Learning

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually

Dev.to · Hardik Mehta

Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)

Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications

Notes: Memory, Context, and Large Language Models (LLMs)

Learn how memory and context work in Large Language Models (LLMs) and potential improvements

Dev.to · Vladimir Panov

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)