A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Key Takeaways
This video series demonstrates fine-tuning a custom LLM using the Llama 2 model, Hugging Face APIs, and Google Colab, with a focus on retrieval augmented generation (RAG) and combining fine-tuning with RAG for improved performance. The series also covers using Metaflow with Kubernetes and Pipie decorators for workflow management and running on a Triton inference server for high-performance computing.
Full Transcript
let's do it man let's just go quickly through the slides they tell they tell the story a little bit and then we can pop into the fine tuning code and look at more hugging apis and this is the last part of the workshop as well yes yes and this part is going to be a little hands-off like we we'll do full on fine tuning workshops in the future where this is like this is very handwavy to be honest um cool so fully custom llm now we're like changing the dyam of this picture by making everything dark purple where what we mean is basically like if you have enough relevant data and you have enough compute capacity you you can make your own model just do the pre-training part yourself um and the idea is that hopefully if you had enough relevant data that's within the distribution you care about when your users are going to be interacting with the the model and your product that would produce the highest quality of relevance responses now in reality like getting that much data dat and training a model that size there's only a few organizations in the world that can do it and they're training on data sets that are like the entire internet so it's like kind of subject or like suspect in my opinion to say that that's going to be super relevant to anyone when it's that General it's almost a built-in tradeoff now there's this other world of like fine-tuning where it's like you have a data set maybe your data set is like is quite a bit bigger it's like five to 10 times bigger maybe even a couple orders of magnitude bigger than the kind of data sets we're talking about with rag um or or maybe not maybe maybe some models like you don't need that much data to do fine tuning um it's really a case-by case basis that's that's worth studying as a data scientist if you're into this stuff these days um but what you can do is you can actually keep training the llm so you're not pre-training you're taking like a fork or or transfer learning there's like a variation on transfer learning um you're taking a fork of the model's State and you're saying like Okay now here's a new data set maybe it has a little bit of different different shape or there's like this trend called instruction tuning where you format the like labeled examples that you give the llm very specifically um and then uh you use that to continue training the model so you're actually updating the weights the parameters of the neural network um of course you could also combine this with rag so there's like a lot of interesting research around this um I've even seen like rag being combined into pre-training as like a recent project from um so there's a whole bunch of different combinations is the point um but the the core idea of what we were talking about here is more this case of fine tuning where it's like I actually want to change this blob and make it more dark purple by training on my data training on a specific shape of data that I think is relevant to our apps and use cases things like this so how do we actually do this there a ton of good examples on the internet these days um but one way is you can Define your model scripts so this is not too different from a Google collab notebook I found on the internet um just like slight refactoring and little bit of like adding a little extra optimizations on it um it should be runable on a single V100 if you have a V100 GPU um and what it's doing is fine-tuning a Cura of the seven billion parameter llama 2 model so that was a bunch of jargon um I think maybe we should talk about llama a little bit Hugo do you want to like say a few things about kind of the Llama model family or anything come to mind there well yeah I mean a short introduction to everything you just said is llama's a family that was open sourced by meta um AI um pretty pretty hefty big big models I mean there's a range um you know but did you say did you say 7B yeah this one's the small one the 7D is the so but it's still big right and and so um then what uh Laura does it's it's a forma dimensionality reduction um it uses linear algebra essentially to to to reduce the the size of it in in very clear ways um the clever ways sorry the queue is quantized um I um which is a technical detail I don't think we need to get into at the moment but that's kind of I suppose my very tldr on what like Cur on llama 7B Lama 27b would mean or something is there anything else you'd add to that no I think I think that's great um so like in addition to like like adding a little bit of stuff from the collab notebook I found is like the framework but also I'm sorry yeah to to your point though even though we have a small model and we're reducing it even more we still need gpus right so right like this this is just not this one's not very feasible to run on a CPU it would be like it it would take way too long for the model to learn anything for that to be really worth your time or or honestly it might even be more expensive because you need the computer to be running long um okay anyways um basically what's going on here is we're sort of refactoring our custom data set um we're pulling this this nice instruction tuning data set that the data bricks team put together called Dolly 15K and we're kind of packaging it in a slightly different way this is what I mean when I was saying like you can change the shape of how the model looks to continue its responses by fine tuning so what you can do is kind of like unpack um the data in a specific format that's like how you would want the model to respond at runtime and then you can kind of curate your fine-tuning data set to be in a shape like this and it sort of just directs the model about how to respond to things um like this could be another good use case of like I want this specific structure only as the output of my data you just make a really good labeled instruction tuning data set like this could work um what else is interesting here we see the same API for importing our model so just like we were doing before in the tinier model use case like just write in your sandbox notebook with no gpus or anything it's the same thing use this from pre-train API we have our quantization config um and then we're just picking a specific flavor of the Llama model um a few other hugging face bells and whistles here of course there's the tokenizer piece um and then the API that we're choosing to use is the sort of the highest level API for model training and hugging face is this um object called the trainer which takes in another object called training arguments this is where we specify a lot of stuff like how to use the gpus like kind of what way do you accumulate gradients um um you pass it an Optimizer so you can kind of like get into this like more like pytorch layer as well in the trainer arguments but there's a lot of stuff going on here and then it's a simple trainer. train now what the kind of the interesting thing like storywise here like we won't really go too much into the code um but we're running this whole script inside of a metaflow workflow and one of the reasons we're doing that is because then we can take this model after it's been trained zip zip up the whole resulting q and push it to S3 and metap makes this very easy to kind of like wire this up when we're running it inside of our workflow and um like specifically like look at how few lines of code is inside of our metlow structure where we're using the at kubernetes decorator um and the at pipie decorator that Hugo mentioned earlier and um yeah we're just kind of making it all happen so that we can then unpack that model on the server and actually run run this run this um on a Triton inference server in this case is what the example shows here um super high level introduction to the fine-tuning and surveying aspect of this Workshop um Hugo I think we should probably do a full Workshop just on probably each of those pieces independently at some point I was just going to say that and and the reason we've decoupled it like that is every all the code we've run so far you've been able to do in the sandbox um and with with um you know relatively limited compute and and time whereas this You' need pretty significant Hardware as well so you know um that's why we wanted to to do it this way um but hey so it looks like we're going back to the Argo Argo workflows yeah quickly before my battery actually dies we can see that this first workflow ran uh we can go back and see that our second workflow also ran to completion again automatically triggered and our third workflow is now in Flight it finished the start process right now it's kind of doing the heavy compute step which is actually Compu Computing the embeddings and all the metaflow documentation and then it's going to push those into the pine cone index um so that pretty much the story and we're out of slides and out of code in the sandbox so far um any thoughts youo to to wrap up the session look I'd actually just like go to go back I'm going to share my screen again and just go back to our um can you see my slides our slides yeah I'm going to quickly move and just make sure I can go get a charger before my thing dies as you're doing this sure sure um what I wanted to do is just go back to when we talked about just to remind you everything we've we've done and if you coded along everything you did which you know everyone should be super congratulated um you've you've learned and played around with hitting uh llm apis um you've gone through and and thought through with us how you can be more flexible with open source software models um thinking about Rags how to great rathon and thinking about better uh relevancy there going through the metaflow docs which was super fun um then productionizing Rags then also going through this is the code we didn't didn't execute but going through fine-tuning a custom llm so all in just over a couple of hours you know and if you did this at 1.5x under a couple of hours um you've managed to learn a huge amount uh about the space um as I said uh in the description we'll put the link to our uh slack community where we love chatting about all of this all of this type of stuff very much welcome feedback um anything else you'd like to see in the sandbox uh we would' love to show you and and produce at some point as well um but I think with that that's a really nice note to wrap up on Eddie what do you think sounds great it was super fun Hugo we should definitely do this more often that was that was an incredible amount of fun and thank you all for for watching and sticking around until the end and we'll we'll see you on slack all right ciao yeah
Original Description
This is a 6 video series interactive guided tour to LLMs, RAG, & Fine-Tuning.
The playlist is here: https://youtube.com/playlist?list=PLUsOvkBBnJBcZglk6QQyKGZsgEzClGnv-&si=66stnfv3-HXa60m9
You can also watch the full workshop here: https://youtu.be/uDBGwQ7JAzQ
In this workshop, attendees will learn about methods for working with LLMs. Our stories will be guided by examples you can run on your laptop or in a (free) hosted cloud environment provided to attendees. Developers will expand their awareness of how researchers and product designers are working with LLMs, with emphasis on connecting high-level concepts such as fine-tuning and vector databases to the fundamental math and APIs data scientists should understand. Business-minded executives can either get hands-on or follow the higher-level stories to deepen their sense of what is possible with LLMs, the technicalities behind risks they introduce, and how they fit into the arc of ML. The primary value of this workshop will be as a guide to help teams set reasonable goals in the complex and fast-moving world of LLMs, and understand what you need to successfully support your team’s next LLM projects.
What You’ll Learn:
There are cheap (e.g., APIs) and expensive (e.g., fine-tuning, training) ways to build on top of LLMs. The methods you choose have consequences in apps you can build and how your dev team works. We will learn how to think about these choices as we develop basic apps you can use as templates for future genAI projects. Learners have the option to follow along in a provided dev environment where we will unpack these choices and make the tradeoffs and decision space concrete.
The Github repository is here: https://github.com/outerbounds/generative-ai-summit-austin-2023
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 53 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
▶
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: Fine-tuning LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
What Is RAG? The AI Technology That Makes ChatGPT Smarter Without Retraining
Medium · RAG
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Medium · AI
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Medium · Machine Learning
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI