Fine-tuning LLMs with PEFT and LoRA

Sam Witteveen · Beginner ·📄 Research Papers Explained ·3y ago

Skills: Fine-tuning LLMs85%

Key Takeaways

This video demonstrates fine-tuning large language models using PEFT and LoRA, showcasing techniques to prevent catastrophic forgetting and achieve good generalization with a small amount of data. The video covers the basics of PEFT, including using LoRA for fine-tuning and training models with gradient accumulation steps.

Full Transcript

so what's the problem with training large language models and fine-tuning them the key thing here is that we end up with really big weights this raises a whole bunch of problems here and these problems are two main things one you need a lot more compute to train for this and as the models are getting larger and larger you're finding that you need much bigger gpus multiple gpus just to be able to fine-tune some of these models the second problem is that in addition to basically needing to compute the file sizes become huge so the T5 XXL checkpoint is around about 40 GB in size not to mention the sort of 20 billion parameter models that we've got coming out now are getting bigger and bigger all the time so this is where this idea of parameter efficient fine-tuning comes in so I'm just going to talk about this as PFT going forward so PFT uses a variety of different techniques the one we're going to be looking at today is Laura which stands for low rank adaption and it comes from a paper all about doing this for large language models but pfta also has some other cool techniques like prefix tuning ptuning and prompt tuning that we'll look at in the future and when to use those and how they can be really useful and there are some of the techniques that are actually being used by companies like Nvidia to allow people to fine-tune these models in the cloud so that's something really interesting to look at so what PF does and with Laura in particular is that it's just allowing you to fine-tune only a small number of extra weights in the model while you freeze most of the parameters of the pre-trained network so the idea here is that we're not actually training the original weights we're adding some extra weights and we're going to fine-tune those one of the advantages of this is that we've still got the original weights so this also tends to help with stopping catastrophic forgetting if you don't know catastrophic forgetting is where models tend to forget what they were originally trained on when you do a fine tuning if you do the fine tuning too much you end up then causing it to forget some of the things from the original data that it was trained on but PF doesn't have that problem because it's just adding extra weights and it's tuning those as it freezes the original ones so PFT also allows you to get really good fine-tuning when you've only got a small amount of data and also it allows this to generalize better to other scenarios as well so in all this sort of thing is a huge win for fine-tuning large language models and even models like stable diffusion a lot of the AI models that we're seeing currently are starting to use this as well one of the best things is that you end up at the end with just tiny checkpoints in one of my recent videos I showed fine-tuning the Llama model to create the alpaca model and I think the final checkpoint for just the add-on part was something around 12 megabytes so it's tiny now you still need the original weights so it's not like you're getting away totally from everything but you've got something that's much smaller so in general the PFT approaches allow you to basically get similar performance to fine-tuning a full model just by fine-tuning and tuning these add-on weights that you're going to put into it hugging face has released a whole sort of Library around this and this is what where this comes in is they've taken a number of papers and implemented them to work with the Transformers library and the accelerate Library so this allows us to basically take off-the-shelf hugging face pre-trained models that have been done by Google done by meta done by a variety of different companies and put them into something where we can use them with this and fine-tune them really well so we're going to jump into the code and we're going to look at how to basically use to do a Laura fine-tuning of a model all right in this notebook we're going to go through and look at training up a model or fine-tuning a model using PFT bits and bites and doing a Laura checkpoint for this so this is a Laura fine tuning so if you remember the idea with Laura is that we're training sort of adapters that go on we're not training the actual weights we're adding weights to the model at various points points in the model and we're fine-tuning those to get the results out that we want so you just come up at you install your libraries here I always like to set up the hugging face Hub early because if you're going to leave this running and it gets to the end of the training you want to basically save your model your weights up to huging face Hub as quickly as possible so that your collab doesn't stop and then you lose all your work in there I tend to put this up the front this is basically just get your hug click here get your huging face token you'll need a token obviously to do this so This collab I've run on an a100 but you can certainly you should be able to do it with a T4 if you change the model to be a smaller version of the Bloom model so the model that I'm training here or fine-tuning here is the bloom 7 billion parameter model and there's also like a 760 version I think there's also a 1.3 billion version Etc that you could try out so we're loading in the model so you'll see we just got an A we've just from Transformers we're bringing in bits and bytes which is going to handle the 8 bit turning our model into 8 bit which means that it won't take up so much GPU Ram uh makes it easier makes it quicker makes it easier to store things later on too and we've got our Auto tokenizer and we've got this Auto model for causal language modeling so when we just bring in from pre-trained we can pass in the name for the bloom 7 billion and all we have to do here is pass in load in 8bit equals true and Transformers will take care of the 8bit conversion using the bits and btes library for doing this if you're using a GPU at home where you've perhaps got a 3090 or something like that and you want to try it on there if you've got multiple gpus you can do a device map to basically map parts of the model across but in this case we're just using Auto and I suggest you try out Auto at the start anyway so we've got our model in we've got our tokenizer in here the next thing we want to do is basically go through and freeze the original weights so you can see here that we're basically just going through and freezing these weights with a few exceptions the layer Norm layers we want to keep them and we want to actually keep them in float 32 and also the outputs we want to keep as being float 32 so what this is just doing this is some standard code for you for doing that next up is setting up the actual Laura adapters so this all comes down to the config here so we're going to basically get the config so we' remember up here we've got our model here and this is the fullsize model but there's no Laura added to that yet in here we're going to make this config and then we're going to basically pass in the model that we had and then get the PFT model which is going to have the original model and the Laura adapters on this so the config here is key you're basically setting the number of attention head that you want the alpha scaling if you know that your model's got certain Target modules I don't find a lot of documentation about this in the library at the moment but my guess is that that going forward people will work out are these are the best modules in large module to basically have Laura adapters on there setting your Dropout for Laura and another key one is just setting the task type so is it a causal language model meaning that it's a decoder only Model A GPT style model or is it going to be a seekto seek model more like the T5 models the flan models Etc and I'll perhaps make another video of going through TR find tuning a SE toseek model so you can see differences in here so by playing around with these two settings up here this will determine the size of the trainable amount quite a lot so you can try out some different ideas here but you'll see that okay we've got this 7 billion parameter for all the parameters but the trainable parameters is just tiny really tiny in here so this gives us the total trainable parameters that we can see that's going on there all right in this case for data so I've just picked a really simple little task in here there's this data set of English quotes rather than what most people seem to do is use that to finish a quote so that if someone starts a quote and it can finish it looking at the data set I saw that there are actually a bunch of tags about the quotes and what I thought would be cool is let's try and make a model where you can input your own quote and it will then generate tags for that quote so you can see here what I've done is basically just merg some of the columns to make it so we've got this quote and then we've got these three characters here now those three characters are chosen because they're probably not going to appear in that order very often in the pre-training and stuff like that so we're trying to teach the model that anytime you see these three characters we're going to condition on the input before that and we're going to generate the tags out after that so you can see here by looking at the data set that we've made we've got this be yourself everyone else has taken and the tags so we've got that there and then the tags are going to be this be yourself honesty inspirational Mis attributed to Oscar wild these kinds of things now some of them are probably being able to predict whether a quote was mis attributed to someone is probably not going to be easy for the model to learn to do especially if you're making up the quotes but suddenly elements about what the sort of key wordss in the quote should be appearing up here as you see here things like so many books So Little Time Books humor right that's a good one to try out let me just take that and we can try that later on so we've got the data there we're just running it through to basically to get the input IDs the attention masks all for that now we want to set up our training the training is just going to use the hugging face the sort of the Transformers trainer we pass in the model here we then pass in the train data set so you can see here we've got this train data set and then we've got to pass in the arguments so let's go through some of the arguments the first ones are this we're going to have gradient accumulation steps meaning and this these are the things that you would change if you're trying to run on a smaller GPU so here we've got we're going to do four examples for four forward passes and then we're going to do four of those before we calculate the gradients so normally if you think of a batch if you were training this with a lot of gpus you would just do a batch size of 128 or a lot more in the Llama paper they're using batch sizes of 4 million right they're using so many gpus unfortunately we don't have that budget so here what I'm trying to show you is that you could use and this is probably underutilizing it for the a100 we could actually make the batches bigger here but you can see here we're basically saying we're going to do four uh examples at a time we're going to collect those gradients we're going to accumulate them for four steps and then that will be one batch so it's the equivalent of doing a batch of 16 here next up we want to set up the warm-up steps so we don't want to just go in there and start with our learning rate at the full amount and Shake everything around we start with the learning rate being extremely low and then building up to the learning rate that we've set and that will take a certain amount of time and then we can set the max steps here so the max steps here I've set it is very small this is more just a toy project to show you getting something loading we're using floating Point 16 we're setting this in here we've got the outputs where we're going to be checking things and then we just kick off our training you can see here that it's going to tell us okay how long it's going to train in this case it's trained very quickly but you might find for your particular one it's going to train for a lot longer and then we can see like over time that yes sure enough our loss is going down so the model is starting to learn something you could go through and experiment doing this with lot training than what I've done here then the next part is sharing this onto the huging face Hub so here you can see I've basically just put my huging face Hub username slash then the model name that I'm going to call it so this is the bloom 7 billion Laura tager is what I've called this here and I could put some info in for the commit message I can set this to be private or to be public I will come and make this checkpoint public afterwards so that you can play with this but that will then basically upload it and it's just going to upload the Laura weights too it's not uploading the Full Bloom model plus the Laura weights so you'll find on the hunging face Hub this is going to be a tiny tiny file we're talking about megabytes here not gigabytes here in fact you can see here that this is going to be 31 Meg or something when it's fully up uploaded the next thing is if you want to just do inference you can just basically this is how you would bring it in so you can basically load this in and then this will basically put together the one that you've trained but it also bring in the actual full model as well so you can see that this is basically bringing in it's going to work out from this okay I need the bloom 7 billion model I'll bring that in I need the to organizer for that and I'll bring those in and it will go off and download those then finally you're left with this you can basically do some inference and here we're basically passing in a quote and we've got our sort of magic three characters that we're going to put out and then it's going to predict something now you can see that okay I haven't trained it that long so it does seem to go into a loop we could even look at putting a end of sentence tag or something like that in there as well in the data but we can see okay the world is your oyst so it's worked out the keywords there world and oyster let's see I think I put in this one so many books so little time and we could change this obviously here we could change the max tokens Etc okay so many books so little time it's generated books reading time Reading Writing time writing gone again you can see that okay it's going into sort of repeat mode this would help probably help if we did this on lot more let's put in just something okay training mods with p and lur right is cool let's see okay what will it pick out for that and you'll find that some of them it will obviously could pick out keywords but for some of them too it will pick out other things now it's interesting okay so it's got training and teaching here hasn't really worked out PF and Laura which is to be expected and you can see here that it's got some of its previous training still in there so you would probably want to it looks like that there's some things related to training models in there that it bouncing off you'd want to train this for for longer and if you really wanted to use this as a model but this gives you just a good example of how to make a causal language model with PFT fine-tuning a bigger causal language model with Laura and then you can use that for something that you particularly want it's very easy to play with your data set put the whole thing together in here as always if there's any questions please put them in the comments if you found this useful please click like And subscribe and and feel free to let me know what you would like to see videos going forward bye for now

Original Description

LoRA Colab : https://colab.research.google.com/drive/14xo6sj4dARk8lXZbOifHEn1f_70qNAwy?usp=sharing Blog Post: https://huggingface.co/blog/peft LoRa Paper: https://arxiv.org/abs/2106.09685 In this video I look at how to use PEFT to fine tune any decoder style GPT model. This goes through the basics LoRa fine-tuning and how to upload it to HuggingFace Hub. For more tutorials on using LLMs and building Agents, check out my Patreon: Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen My Links: Linkedin: https://www.linkedin.com/in/samwitteveen/ Github: https://github.com/samwit/langchain-tutorials https://github.com/samwit/llm-tutorials 00:00 Intro 00:04 - Problems with fine-tuning 00:48 - Introducing PEFT 01:11 - PEFT other cool techniques 01:51 - LoRA Diagram 03:25 - Hugging Face PEFT Library 04:06 - Code Walkthrough

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sam Witteveen · Sam Witteveen · 44 of 60

← Previous Next →

LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab

LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab

LangChain Basics Tutorial #2 Tools and Chains

LangChain Basics Tutorial #2 Tools and Chains

ChatGPT API Announcement & Code Walkthrough with LangChain

ChatGPT API Announcement & Code Walkthrough with LangChain

Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference

Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference

LangChain - Conversations with Memory (explanation & code walkthrough)

LangChain - Conversations with Memory (explanation & code walkthrough)

LangChain Chat with Flan20B

LangChain Chat with Flan20B

LangChain - Using Hugging Face Models locally (code walkthrough)

LangChain - Using Hugging Face Models locally (code walkthrough)

PAL : Program-aided Language Models with LangChain code

PAL : Program-aided Language Models with LangChain code

Building a Summarization System with LangChain and GPT-3 - Part 1

Building a Summarization System with LangChain and GPT-3 - Part 1

Building a Summarization System with LangChain and GPT-3 - Part 2

Building a Summarization System with LangChain and GPT-3 - Part 2

Microsoft's Visual ChatGPT using LangChain

Microsoft's Visual ChatGPT using LangChain

Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo

Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo

LangChain Agents - Joining Tools and Chains with Decisions

LangChain Agents - Joining Tools and Chains with Decisions

Investigating Alpaca 7B - Finetuned LLaMa LLM

Investigating Alpaca 7B - Finetuned LLaMa LLM

Comparing LLMs with LangChain

Comparing LLMs with LangChain

Running Alpaca7B in Colab

Running Alpaca7B in Colab

How to finetune your own Alpaca 7B

How to finetune your own Alpaca 7B

How to make a custom dataset like Alpaca7B

How to make a custom dataset like Alpaca7B

Understanding Constitutional AI - the paper and key concepts

Understanding Constitutional AI - the paper and key concepts

Using Constitutional AI in LangChain

Using Constitutional AI in LangChain

Talking to Alpaca with LangChain - Creating an Alpaca Chatbot

Talking to Alpaca with LangChain - Creating an Alpaca Chatbot

Text-to-video-synthesis with Diffusers and Colab

Text-to-video-synthesis with Diffusers and Colab

Meet Dolly the new Alpaca model

Meet Dolly the new Alpaca model

Checking out the Cerebras-GPT family of models

Checking out the Cerebras-GPT family of models

A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)

A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)

Is GPT4All your new personal ChatGPT?

Is GPT4All your new personal ChatGPT?

Raven - RWKV-7B RNN's LLM Strikes Back

Raven - RWKV-7B RNN's LLM Strikes Back

Talk to your CSV & Excel with LangChain

Talk to your CSV & Excel with LangChain

Vicuna - 90% of ChatGPT quality by using a new dataset?

Vicuna - 90% of ChatGPT quality by using a new dataset?

Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍

Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍

Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)

Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)

BabyAGI: Discover the Power of Task-Driven Autonomous Agents!

BabyAGI: Discover the Power of Task-Driven Autonomous Agents!

Auto-GPT - How to Automate a Task Based AI with GPT-4

Auto-GPT - How to Automate a Task Based AI with GPT-4

Improve your BabyAGI with LangChain

Improve your BabyAGI with LangChain

Generative Agents - Deep Dive and GPT-4 Recreation

Generative Agents - Deep Dive and GPT-4 Recreation

GPT4ALLv2: The Improvements and Drawbacks You Need to Know!

GPT4ALLv2: The Improvements and Drawbacks You Need to Know!

Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!

Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!

Red Pajama - Operation: Freeing LLaMA

Red Pajama - Operation: Freeing LLaMA

Investigating Open Assistant - Models, Datasets and Addons

Investigating Open Assistant - Models, Datasets and Addons

Investigating MiniGPT-4 - The Secret behind GPT-V?

Investigating MiniGPT-4 - The Secret behind GPT-V?

Stable LM 3B - The new tiny kid on the block.

Stable LM 3B - The new tiny kid on the block.

Bard can now code and put that code in Colab for you.

Bard can now code and put that code in Colab for you.

Checking out Bark: a Text to Speech system by Suno AI

Checking out Bark: a Text to Speech system by Suno AI

Fine-tuning LLMs with PEFT and LoRA

Fine-tuning LLMs with PEFT and LoRA

Master PDF Chat with LangChain - Your essential guide to queries on documents

Master PDF Chat with LangChain - Your essential guide to queries on documents

Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools

Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools

Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)

Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)

StableVicuna: The New King of Open ChatGPTs?

StableVicuna: The New King of Open ChatGPTs?

WizardLM: Evolving Instruction Datasets to Create a Better Model

WizardLM: Evolving Instruction Datasets to Create a Better Model

LaMini-LM - Mini Models Maxi Data!

LaMini-LM - Mini Models Maxi Data!

Finding the Best Free ChatGPT

Finding the Best Free ChatGPT

MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model

MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model

LangChain Retrieval QA Over Multiple Files with ChromaDB

LangChain Retrieval QA Over Multiple Files with ChromaDB

LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs

LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs

LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

Transformers Agent - Is this Hugging Face's LangChain Competitor?

Transformers Agent - Is this Hugging Face's LangChain Competitor?

StarCoder - The LLM to make you a coding star?

StarCoder - The LLM to make you a coding star?

Testing Starcoder for Reasoning with PAL

Testing Starcoder for Reasoning with PAL

The New Wizards - Unfiltered & Unaligned

The New Wizards - Unfiltered & Unaligned

Camel + LangChain for Synthetic Data & Market Research

Camel + LangChain for Synthetic Data & Market Research

This video teaches how to fine-tune large language models using PEFT and LoRA, covering techniques such as 8-bit conversion, gradient accumulation steps, and uploading models to Hugging Face Hub. The video is practical and hands-on, with code examples and step-by-step instructions.

Key Takeaways

Install necessary libraries
Set up Hugging Face Hub
Load in a pre-trained model
Convert model to 8-bit using bits and bytes
Freeze original weights
Set up LoRA adapters with config
Merge columns to create a new dataset
Choose characters to condition on for generating tags
Run data through to get input IDs and attention masks
Set up training with gradient accumulation steps

💡 Using PEFT and LoRA for fine-tuning large language models can help prevent catastrophic forgetting and achieve good generalization with a small amount of data.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Fine-tuning LLMs

View skill →

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Advanced Fine-Tuning in Rust

Advanced Fine-Tuning in Rust

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Chapters (7)

Intro

0:04 Problems with fine-tuning

0:48 Introducing PEFT

1:11 PEFT other cool techniques

1:51 LoRA Diagram

3:25 Hugging Face PEFT Library

4:06 Code Walkthrough

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling