Fine-tuning LLMs with PEFT and LoRA

Sam Witteveen · Beginner ·📄 Research Papers Explained ·3y ago

Key Takeaways

This video demonstrates fine-tuning large language models using PEFT and LoRA, showcasing techniques to prevent catastrophic forgetting and achieve good generalization with a small amount of data. The video covers the basics of PEFT, including using LoRA for fine-tuning and training models with gradient accumulation steps.

Full Transcript

so what's the problem with training large language models and fine-tuning them the key thing here is that we end up with really big weights this raises a whole bunch of problems here and these problems are two main things one you need a lot more compute to train for this and as the models are getting larger and larger you're finding that you need much bigger gpus multiple gpus just to be able to fine-tune some of these models the second problem is that in addition to basically needing to compute the file sizes become huge so the T5 XXL checkpoint is around about 40 GB in size not to mention the sort of 20 billion parameter models that we've got coming out now are getting bigger and bigger all the time so this is where this idea of parameter efficient fine-tuning comes in so I'm just going to talk about this as PFT going forward so PFT uses a variety of different techniques the one we're going to be looking at today is Laura which stands for low rank adaption and it comes from a paper all about doing this for large language models but pfta also has some other cool techniques like prefix tuning ptuning and prompt tuning that we'll look at in the future and when to use those and how they can be really useful and there are some of the techniques that are actually being used by companies like Nvidia to allow people to fine-tune these models in the cloud so that's something really interesting to look at so what PF does and with Laura in particular is that it's just allowing you to fine-tune only a small number of extra weights in the model while you freeze most of the parameters of the pre-trained network so the idea here is that we're not actually training the original weights we're adding some extra weights and we're going to fine-tune those one of the advantages of this is that we've still got the original weights so this also tends to help with stopping catastrophic forgetting if you don't know catastrophic forgetting is where models tend to forget what they were originally trained on when you do a fine tuning if you do the fine tuning too much you end up then causing it to forget some of the things from the original data that it was trained on but PF doesn't have that problem because it's just adding extra weights and it's tuning those as it freezes the original ones so PFT also allows you to get really good fine-tuning when you've only got a small amount of data and also it allows this to generalize better to other scenarios as well so in all this sort of thing is a huge win for fine-tuning large language models and even models like stable diffusion a lot of the AI models that we're seeing currently are starting to use this as well one of the best things is that you end up at the end with just tiny checkpoints in one of my recent videos I showed fine-tuning the Llama model to create the alpaca model and I think the final checkpoint for just the add-on part was something around 12 megabytes so it's tiny now you still need the original weights so it's not like you're getting away totally from everything but you've got something that's much smaller so in general the PFT approaches allow you to basically get similar performance to fine-tuning a full model just by fine-tuning and tuning these add-on weights that you're going to put into it hugging face has released a whole sort of Library around this and this is what where this comes in is they've taken a number of papers and implemented them to work with the Transformers library and the accelerate Library so this allows us to basically take off-the-shelf hugging face pre-trained models that have been done by Google done by meta done by a variety of different companies and put them into something where we can use them with this and fine-tune them really well so we're going to jump into the code and we're going to look at how to basically use to do a Laura fine-tuning of a model all right in this notebook we're going to go through and look at training up a model or fine-tuning a model using PFT bits and bites and doing a Laura checkpoint for this so this is a Laura fine tuning so if you remember the idea with Laura is that we're training sort of adapters that go on we're not training the actual weights we're adding weights to the model at various points points in the model and we're fine-tuning those to get the results out that we want so you just come up at you install your libraries here I always like to set up the hugging face Hub early because if you're going to leave this running and it gets to the end of the training you want to basically save your model your weights up to huging face Hub as quickly as possible so that your collab doesn't stop and then you lose all your work in there I tend to put this up the front this is basically just get your hug click here get your huging face token you'll need a token obviously to do this so This collab I've run on an a100 but you can certainly you should be able to do it with a T4 if you change the model to be a smaller version of the Bloom model so the model that I'm training here or fine-tuning here is the bloom 7 billion parameter model and there's also like a 760 version I think there's also a 1.3 billion version Etc that you could try out so we're loading in the model so you'll see we just got an A we've just from Transformers we're bringing in bits and bytes which is going to handle the 8 bit turning our model into 8 bit which means that it won't take up so much GPU Ram uh makes it easier makes it quicker makes it easier to store things later on too and we've got our Auto tokenizer and we've got this Auto model for causal language modeling so when we just bring in from pre-trained we can pass in the name for the bloom 7 billion and all we have to do here is pass in load in 8bit equals true and Transformers will take care of the 8bit conversion using the bits and btes library for doing this if you're using a GPU at home where you've perhaps got a 3090 or something like that and you want to try it on there if you've got multiple gpus you can do a device map to basically map parts of the model across but in this case we're just using Auto and I suggest you try out Auto at the start anyway so we've got our model in we've got our tokenizer in here the next thing we want to do is basically go through and freeze the original weights so you can see here that we're basically just going through and freezing these weights with a few exceptions the layer Norm layers we want to keep them and we want to actually keep them in float 32 and also the outputs we want to keep as being float 32 so what this is just doing this is some standard code for you for doing that next up is setting up the actual Laura adapters so this all comes down to the config here so we're going to basically get the config so we' remember up here we've got our model here and this is the fullsize model but there's no Laura added to that yet in here we're going to make this config and then we're going to basically pass in the model that we had and then get the PFT model which is going to have the original model and the Laura adapters on this so the config here is key you're basically setting the number of attention head that you want the alpha scaling if you know that your model's got certain Target modules I don't find a lot of documentation about this in the library at the moment but my guess is that that going forward people will work out are these are the best modules in large module to basically have Laura adapters on there setting your Dropout for Laura and another key one is just setting the task type so is it a causal language model meaning that it's a decoder only Model A GPT style model or is it going to be a seekto seek model more like the T5 models the flan models Etc and I'll perhaps make another video of going through TR find tuning a SE toseek model so you can see differences in here so by playing around with these two settings up here this will determine the size of the trainable amount quite a lot so you can try out some different ideas here but you'll see that okay we've got this 7 billion parameter for all the parameters but the trainable parameters is just tiny really tiny in here so this gives us the total trainable parameters that we can see that's going on there all right in this case for data so I've just picked a really simple little task in here there's this data set of English quotes rather than what most people seem to do is use that to finish a quote so that if someone starts a quote and it can finish it looking at the data set I saw that there are actually a bunch of tags about the quotes and what I thought would be cool is let's try and make a model where you can input your own quote and it will then generate tags for that quote so you can see here what I've done is basically just merg some of the columns to make it so we've got this quote and then we've got these three characters here now those three characters are chosen because they're probably not going to appear in that order very often in the pre-training and stuff like that so we're trying to teach the model that anytime you see these three characters we're going to condition on the input before that and we're going to generate the tags out after that so you can see here by looking at the data set that we've made we've got this be yourself everyone else has taken and the tags so we've got that there and then the tags are going to be this be yourself honesty inspirational Mis attributed to Oscar wild these kinds of things now some of them are probably being able to predict whether a quote was mis attributed to someone is probably not going to be easy for the model to learn to do especially if you're making up the quotes but suddenly elements about what the sort of key wordss in the quote should be appearing up here as you see here things like so many books So Little Time Books humor right that's a good one to try out let me just take that and we can try that later on so we've got the data there we're just running it through to basically to get the input IDs the attention masks all for that now we want to set up our training the training is just going to use the hugging face the sort of the Transformers trainer we pass in the model here we then pass in the train data set so you can see here we've got this train data set and then we've got to pass in the arguments so let's go through some of the arguments the first ones are this we're going to have gradient accumulation steps meaning and this these are the things that you would change if you're trying to run on a smaller GPU so here we've got we're going to do four examples for four forward passes and then we're going to do four of those before we calculate the gradients so normally if you think of a batch if you were training this with a lot of gpus you would just do a batch size of 128 or a lot more in the Llama paper they're using batch sizes of 4 million right they're using so many gpus unfortunately we don't have that budget so here what I'm trying to show you is that you could use and this is probably underutilizing it for the a100 we could actually make the batches bigger here but you can see here we're basically saying we're going to do four uh examples at a time we're going to collect those gradients we're going to accumulate them for four steps and then that will be one batch so it's the equivalent of doing a batch of 16 here next up we want to set up the warm-up steps so we don't want to just go in there and start with our learning rate at the full amount and Shake everything around we start with the learning rate being extremely low and then building up to the learning rate that we've set and that will take a certain amount of time and then we can set the max steps here so the max steps here I've set it is very small this is more just a toy project to show you getting something loading we're using floating Point 16 we're setting this in here we've got the outputs where we're going to be checking things and then we just kick off our training you can see here that it's going to tell us okay how long it's going to train in this case it's trained very quickly but you might find for your particular one it's going to train for a lot longer and then we can see like over time that yes sure enough our loss is going down so the model is starting to learn something you could go through and experiment doing this with lot training than what I've done here then the next part is sharing this onto the huging face Hub so here you can see I've basically just put my huging face Hub username slash then the model name that I'm going to call it so this is the bloom 7 billion Laura tager is what I've called this here and I could put some info in for the commit message I can set this to be private or to be public I will come and make this checkpoint public afterwards so that you can play with this but that will then basically upload it and it's just going to upload the Laura weights too it's not uploading the Full Bloom model plus the Laura weights so you'll find on the hunging face Hub this is going to be a tiny tiny file we're talking about megabytes here not gigabytes here in fact you can see here that this is going to be 31 Meg or something when it's fully up uploaded the next thing is if you want to just do inference you can just basically this is how you would bring it in so you can basically load this in and then this will basically put together the one that you've trained but it also bring in the actual full model as well so you can see that this is basically bringing in it's going to work out from this okay I need the bloom 7 billion model I'll bring that in I need the to organizer for that and I'll bring those in and it will go off and download those then finally you're left with this you can basically do some inference and here we're basically passing in a quote and we've got our sort of magic three characters that we're going to put out and then it's going to predict something now you can see that okay I haven't trained it that long so it does seem to go into a loop we could even look at putting a end of sentence tag or something like that in there as well in the data but we can see okay the world is your oyst so it's worked out the keywords there world and oyster let's see I think I put in this one so many books so little time and we could change this obviously here we could change the max tokens Etc okay so many books so little time it's generated books reading time Reading Writing time writing gone again you can see that okay it's going into sort of repeat mode this would help probably help if we did this on lot more let's put in just something okay training mods with p and lur right is cool let's see okay what will it pick out for that and you'll find that some of them it will obviously could pick out keywords but for some of them too it will pick out other things now it's interesting okay so it's got training and teaching here hasn't really worked out PF and Laura which is to be expected and you can see here that it's got some of its previous training still in there so you would probably want to it looks like that there's some things related to training models in there that it bouncing off you'd want to train this for for longer and if you really wanted to use this as a model but this gives you just a good example of how to make a causal language model with PFT fine-tuning a bigger causal language model with Laura and then you can use that for something that you particularly want it's very easy to play with your data set put the whole thing together in here as always if there's any questions please put them in the comments if you found this useful please click like And subscribe and and feel free to let me know what you would like to see videos going forward bye for now

Original Description

LoRA Colab : https://colab.research.google.com/drive/14xo6sj4dARk8lXZbOifHEn1f_70qNAwy?usp=sharing Blog Post: https://huggingface.co/blog/peft LoRa Paper: https://arxiv.org/abs/2106.09685 In this video I look at how to use PEFT to fine tune any decoder style GPT model. This goes through the basics LoRa fine-tuning and how to upload it to HuggingFace Hub. For more tutorials on using LLMs and building Agents, check out my Patreon: Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen My Links: Linkedin: https://www.linkedin.com/in/samwitteveen/ Github: https://github.com/samwit/langchain-tutorials https://github.com/samwit/llm-tutorials 00:00 Intro 00:04 - Problems with fine-tuning 00:48 - Introducing PEFT 01:11 - PEFT other cool techniques 01:51 - LoRA Diagram 03:25 - Hugging Face PEFT Library 04:06 - Code Walkthrough
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sam Witteveen · Sam Witteveen · 44 of 60

1 LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
Sam Witteveen
2 LangChain Basics Tutorial #2 Tools and Chains
LangChain Basics Tutorial #2 Tools and Chains
Sam Witteveen
3 ChatGPT API Announcement & Code Walkthrough with LangChain
ChatGPT API Announcement & Code Walkthrough with LangChain
Sam Witteveen
4 Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Sam Witteveen
5 LangChain - Conversations with Memory (explanation & code walkthrough)
LangChain - Conversations with Memory (explanation & code walkthrough)
Sam Witteveen
6 LangChain Chat with Flan20B
LangChain Chat with Flan20B
Sam Witteveen
7 LangChain - Using Hugging Face Models locally (code walkthrough)
LangChain - Using Hugging Face Models locally (code walkthrough)
Sam Witteveen
8 PAL : Program-aided Language Models with LangChain code
PAL : Program-aided Language Models with LangChain code
Sam Witteveen
9 Building a Summarization System with LangChain and GPT-3 - Part 1
Building a Summarization System with LangChain and GPT-3 - Part 1
Sam Witteveen
10 Building a Summarization System with LangChain and GPT-3 - Part 2
Building a Summarization System with LangChain and GPT-3 - Part 2
Sam Witteveen
11 Microsoft's Visual ChatGPT using LangChain
Microsoft's Visual ChatGPT using LangChain
Sam Witteveen
12 Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Sam Witteveen
13 LangChain Agents - Joining Tools and Chains with Decisions
LangChain Agents - Joining Tools and Chains with Decisions
Sam Witteveen
14 Investigating Alpaca 7B - Finetuned LLaMa LLM
Investigating Alpaca 7B - Finetuned LLaMa LLM
Sam Witteveen
15 Comparing LLMs with LangChain
Comparing LLMs with LangChain
Sam Witteveen
16 Running Alpaca7B in Colab
Running Alpaca7B in Colab
Sam Witteveen
17 How to finetune your own Alpaca 7B
How to finetune your own Alpaca 7B
Sam Witteveen
18 How to make a custom dataset like Alpaca7B
How to make a custom dataset like Alpaca7B
Sam Witteveen
19 Understanding Constitutional AI - the paper and key concepts
Understanding Constitutional AI - the paper and key concepts
Sam Witteveen
20 Using Constitutional AI in LangChain
Using Constitutional AI in LangChain
Sam Witteveen
21 Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Sam Witteveen
22 Text-to-video-synthesis with Diffusers and Colab
Text-to-video-synthesis with Diffusers and Colab
Sam Witteveen
23 Meet Dolly the new Alpaca model
Meet Dolly the new Alpaca model
Sam Witteveen
24 Checking out the Cerebras-GPT family of models
Checking out the Cerebras-GPT family of models
Sam Witteveen
25 A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
Sam Witteveen
26 Is GPT4All your new personal ChatGPT?
Is GPT4All your new personal ChatGPT?
Sam Witteveen
27 Raven - RWKV-7B RNN's LLM Strikes Back
Raven - RWKV-7B RNN's LLM Strikes Back
Sam Witteveen
28 Talk to your CSV & Excel with LangChain
Talk to your CSV & Excel with LangChain
Sam Witteveen
29 Vicuna - 90% of ChatGPT quality by using a new dataset?
Vicuna - 90% of ChatGPT quality by using a new dataset?
Sam Witteveen
30 Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Sam Witteveen
31 Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Sam Witteveen
32 BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
Sam Witteveen
33 Auto-GPT - How to Automate a Task Based AI with GPT-4
Auto-GPT - How to Automate a Task Based AI with GPT-4
Sam Witteveen
34 Improve your BabyAGI with LangChain
Improve your BabyAGI with LangChain
Sam Witteveen
35 Generative Agents - Deep Dive and GPT-4 Recreation
Generative Agents - Deep Dive and GPT-4 Recreation
Sam Witteveen
36 GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
Sam Witteveen
37 Dolly 2.0 by Databricks: Open for Business but is it  Ready to Impress!
Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!
Sam Witteveen
38 Red Pajama - Operation: Freeing LLaMA
Red Pajama - Operation: Freeing LLaMA
Sam Witteveen
39 Investigating Open Assistant - Models, Datasets and Addons
Investigating Open Assistant - Models, Datasets and Addons
Sam Witteveen
40 Investigating MiniGPT-4 - The Secret behind GPT-V?
Investigating MiniGPT-4 - The Secret behind GPT-V?
Sam Witteveen
41 Stable LM 3B - The new tiny kid on the block.
Stable LM 3B - The new tiny kid on the block.
Sam Witteveen
42 Bard can now code and put that code in Colab for you.
Bard can now code and put that code in Colab for you.
Sam Witteveen
43 Checking out Bark: a Text to Speech system by Suno AI
Checking out Bark: a Text to Speech system by Suno AI
Sam Witteveen
Fine-tuning LLMs with PEFT and LoRA
Fine-tuning LLMs with PEFT and LoRA
Sam Witteveen
45 Master PDF Chat with LangChain - Your essential guide to queries on documents
Master PDF Chat with LangChain - Your essential guide to queries on documents
Sam Witteveen
46 Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Sam Witteveen
47 Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Sam Witteveen
48 StableVicuna: The New King of Open ChatGPTs?
StableVicuna: The New King of Open ChatGPTs?
Sam Witteveen
49 WizardLM: Evolving Instruction Datasets to Create a Better Model
WizardLM: Evolving Instruction Datasets to Create a Better Model
Sam Witteveen
50 LaMini-LM - Mini Models Maxi Data!
LaMini-LM - Mini Models Maxi Data!
Sam Witteveen
51 Finding the Best Free ChatGPT
Finding the Best Free ChatGPT
Sam Witteveen
52 MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
Sam Witteveen
53 LangChain Retrieval QA Over Multiple Files with ChromaDB
LangChain Retrieval QA Over Multiple Files with ChromaDB
Sam Witteveen
54 LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
Sam Witteveen
55 LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
Sam Witteveen
56 Transformers Agent - Is this Hugging Face's LangChain Competitor?
Transformers Agent - Is this Hugging Face's LangChain Competitor?
Sam Witteveen
57 StarCoder - The LLM to make you a coding star?
StarCoder - The LLM to make you a coding star?
Sam Witteveen
58 Testing Starcoder for Reasoning with PAL
Testing Starcoder for Reasoning with PAL
Sam Witteveen
59 The New Wizards - Unfiltered & Unaligned
The New Wizards - Unfiltered & Unaligned
Sam Witteveen
60 Camel + LangChain for Synthetic Data & Market Research
Camel + LangChain for Synthetic Data & Market Research
Sam Witteveen

This video teaches how to fine-tune large language models using PEFT and LoRA, covering techniques such as 8-bit conversion, gradient accumulation steps, and uploading models to Hugging Face Hub. The video is practical and hands-on, with code examples and step-by-step instructions.

Key Takeaways
  1. Install necessary libraries
  2. Set up Hugging Face Hub
  3. Load in a pre-trained model
  4. Convert model to 8-bit using bits and bytes
  5. Freeze original weights
  6. Set up LoRA adapters with config
  7. Merge columns to create a new dataset
  8. Choose characters to condition on for generating tags
  9. Run data through to get input IDs and attention masks
  10. Set up training with gradient accumulation steps
💡 Using PEFT and LoRA for fine-tuning large language models can help prevent catastrophic forgetting and achieve good generalization with a small amount of data.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning

Chapters (7)

Intro
0:04 Problems with fine-tuning
0:48 Introducing PEFT
1:11 PEFT other cool techniques
1:51 LoRA Diagram
3:25 Hugging Face PEFT Library
4:06 Code Walkthrough
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →