A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

Outerbounds · Beginner ·🔍 RAG & Vector Search ·2y ago

Key Takeaways

This video series demonstrates fine-tuning a custom LLM using the Llama 2 model, Hugging Face APIs, and Google Colab, with a focus on retrieval augmented generation (RAG) and combining fine-tuning with RAG for improved performance. The series also covers using Metaflow with Kubernetes and Pipie decorators for workflow management and running on a Triton inference server for high-performance computing.

Full Transcript

let's do it man let's just go quickly through the slides they tell they tell the story a little bit and then we can pop into the fine tuning code and look at more hugging apis and this is the last part of the workshop as well yes yes and this part is going to be a little hands-off like we we'll do full on fine tuning workshops in the future where this is like this is very handwavy to be honest um cool so fully custom llm now we're like changing the dyam of this picture by making everything dark purple where what we mean is basically like if you have enough relevant data and you have enough compute capacity you you can make your own model just do the pre-training part yourself um and the idea is that hopefully if you had enough relevant data that's within the distribution you care about when your users are going to be interacting with the the model and your product that would produce the highest quality of relevance responses now in reality like getting that much data dat and training a model that size there's only a few organizations in the world that can do it and they're training on data sets that are like the entire internet so it's like kind of subject or like suspect in my opinion to say that that's going to be super relevant to anyone when it's that General it's almost a built-in tradeoff now there's this other world of like fine-tuning where it's like you have a data set maybe your data set is like is quite a bit bigger it's like five to 10 times bigger maybe even a couple orders of magnitude bigger than the kind of data sets we're talking about with rag um or or maybe not maybe maybe some models like you don't need that much data to do fine tuning um it's really a case-by case basis that's that's worth studying as a data scientist if you're into this stuff these days um but what you can do is you can actually keep training the llm so you're not pre-training you're taking like a fork or or transfer learning there's like a variation on transfer learning um you're taking a fork of the model's State and you're saying like Okay now here's a new data set maybe it has a little bit of different different shape or there's like this trend called instruction tuning where you format the like labeled examples that you give the llm very specifically um and then uh you use that to continue training the model so you're actually updating the weights the parameters of the neural network um of course you could also combine this with rag so there's like a lot of interesting research around this um I've even seen like rag being combined into pre-training as like a recent project from um so there's a whole bunch of different combinations is the point um but the the core idea of what we were talking about here is more this case of fine tuning where it's like I actually want to change this blob and make it more dark purple by training on my data training on a specific shape of data that I think is relevant to our apps and use cases things like this so how do we actually do this there a ton of good examples on the internet these days um but one way is you can Define your model scripts so this is not too different from a Google collab notebook I found on the internet um just like slight refactoring and little bit of like adding a little extra optimizations on it um it should be runable on a single V100 if you have a V100 GPU um and what it's doing is fine-tuning a Cura of the seven billion parameter llama 2 model so that was a bunch of jargon um I think maybe we should talk about llama a little bit Hugo do you want to like say a few things about kind of the Llama model family or anything come to mind there well yeah I mean a short introduction to everything you just said is llama's a family that was open sourced by meta um AI um pretty pretty hefty big big models I mean there's a range um you know but did you say did you say 7B yeah this one's the small one the 7D is the so but it's still big right and and so um then what uh Laura does it's it's a forma dimensionality reduction um it uses linear algebra essentially to to to reduce the the size of it in in very clear ways um the clever ways sorry the queue is quantized um I um which is a technical detail I don't think we need to get into at the moment but that's kind of I suppose my very tldr on what like Cur on llama 7B Lama 27b would mean or something is there anything else you'd add to that no I think I think that's great um so like in addition to like like adding a little bit of stuff from the collab notebook I found is like the framework but also I'm sorry yeah to to your point though even though we have a small model and we're reducing it even more we still need gpus right so right like this this is just not this one's not very feasible to run on a CPU it would be like it it would take way too long for the model to learn anything for that to be really worth your time or or honestly it might even be more expensive because you need the computer to be running long um okay anyways um basically what's going on here is we're sort of refactoring our custom data set um we're pulling this this nice instruction tuning data set that the data bricks team put together called Dolly 15K and we're kind of packaging it in a slightly different way this is what I mean when I was saying like you can change the shape of how the model looks to continue its responses by fine tuning so what you can do is kind of like unpack um the data in a specific format that's like how you would want the model to respond at runtime and then you can kind of curate your fine-tuning data set to be in a shape like this and it sort of just directs the model about how to respond to things um like this could be another good use case of like I want this specific structure only as the output of my data you just make a really good labeled instruction tuning data set like this could work um what else is interesting here we see the same API for importing our model so just like we were doing before in the tinier model use case like just write in your sandbox notebook with no gpus or anything it's the same thing use this from pre-train API we have our quantization config um and then we're just picking a specific flavor of the Llama model um a few other hugging face bells and whistles here of course there's the tokenizer piece um and then the API that we're choosing to use is the sort of the highest level API for model training and hugging face is this um object called the trainer which takes in another object called training arguments this is where we specify a lot of stuff like how to use the gpus like kind of what way do you accumulate gradients um um you pass it an Optimizer so you can kind of like get into this like more like pytorch layer as well in the trainer arguments but there's a lot of stuff going on here and then it's a simple trainer. train now what the kind of the interesting thing like storywise here like we won't really go too much into the code um but we're running this whole script inside of a metaflow workflow and one of the reasons we're doing that is because then we can take this model after it's been trained zip zip up the whole resulting q and push it to S3 and metap makes this very easy to kind of like wire this up when we're running it inside of our workflow and um like specifically like look at how few lines of code is inside of our metlow structure where we're using the at kubernetes decorator um and the at pipie decorator that Hugo mentioned earlier and um yeah we're just kind of making it all happen so that we can then unpack that model on the server and actually run run this run this um on a Triton inference server in this case is what the example shows here um super high level introduction to the fine-tuning and surveying aspect of this Workshop um Hugo I think we should probably do a full Workshop just on probably each of those pieces independently at some point I was just going to say that and and the reason we've decoupled it like that is every all the code we've run so far you've been able to do in the sandbox um and with with um you know relatively limited compute and and time whereas this You' need pretty significant Hardware as well so you know um that's why we wanted to to do it this way um but hey so it looks like we're going back to the Argo Argo workflows yeah quickly before my battery actually dies we can see that this first workflow ran uh we can go back and see that our second workflow also ran to completion again automatically triggered and our third workflow is now in Flight it finished the start process right now it's kind of doing the heavy compute step which is actually Compu Computing the embeddings and all the metaflow documentation and then it's going to push those into the pine cone index um so that pretty much the story and we're out of slides and out of code in the sandbox so far um any thoughts youo to to wrap up the session look I'd actually just like go to go back I'm going to share my screen again and just go back to our um can you see my slides our slides yeah I'm going to quickly move and just make sure I can go get a charger before my thing dies as you're doing this sure sure um what I wanted to do is just go back to when we talked about just to remind you everything we've we've done and if you coded along everything you did which you know everyone should be super congratulated um you've you've learned and played around with hitting uh llm apis um you've gone through and and thought through with us how you can be more flexible with open source software models um thinking about Rags how to great rathon and thinking about better uh relevancy there going through the metaflow docs which was super fun um then productionizing Rags then also going through this is the code we didn't didn't execute but going through fine-tuning a custom llm so all in just over a couple of hours you know and if you did this at 1.5x under a couple of hours um you've managed to learn a huge amount uh about the space um as I said uh in the description we'll put the link to our uh slack community where we love chatting about all of this all of this type of stuff very much welcome feedback um anything else you'd like to see in the sandbox uh we would' love to show you and and produce at some point as well um but I think with that that's a really nice note to wrap up on Eddie what do you think sounds great it was super fun Hugo we should definitely do this more often that was that was an incredible amount of fun and thank you all for for watching and sticking around until the end and we'll we'll see you on slack all right ciao yeah

Original Description

This is a 6 video series interactive guided tour to LLMs, RAG, & Fine-Tuning. The playlist is here: https://youtube.com/playlist?list=PLUsOvkBBnJBcZglk6QQyKGZsgEzClGnv-&si=66stnfv3-HXa60m9 You can also watch the full workshop here: https://youtu.be/uDBGwQ7JAzQ In this workshop, attendees will learn about methods for working with LLMs. Our stories will be guided by examples you can run on your laptop or in a (free) hosted cloud environment provided to attendees. Developers will expand their awareness of how researchers and product designers are working with LLMs, with emphasis on connecting high-level concepts such as fine-tuning and vector databases to the fundamental math and APIs data scientists should understand. Business-minded executives can either get hands-on or follow the higher-level stories to deepen their sense of what is possible with LLMs, the technicalities behind risks they introduce, and how they fit into the arc of ML. The primary value of this workshop will be as a guide to help teams set reasonable goals in the complex and fast-moving world of LLMs, and understand what you need to successfully support your team’s next LLM projects. What You’ll Learn: There are cheap (e.g., APIs) and expensive (e.g., fine-tuning, training) ways to build on top of LLMs. The methods you choose have consequences in apps you can build and how your dev team works. We will learn how to think about these choices as we develop basic apps you can use as templates for future genAI projects. Learners have the option to follow along in a provided dev environment where we will unpack these choices and make the tradeoffs and decision space concrete. The Github repository is here: https://github.com/outerbounds/generative-ai-summit-austin-2023
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 53 of 60

1 Metaflow GUI for monitoring machine learning workflows
Metaflow GUI for monitoring machine learning workflows
Outerbounds
2 Metaflow Cards [no sound]
Metaflow Cards [no sound]
Outerbounds
3 Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
4 Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
5 Metaflow on Kubernetes and Argo Workflows [no sound]
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
6 Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
7 Metaflow Tags: Programmatic Tagging
Metaflow Tags: Programmatic Tagging
Outerbounds
8 Metaflow Tags: Basic Tagging
Metaflow Tags: Basic Tagging
Outerbounds
9 Metaflow Tags: Tags in CI/CD
Metaflow Tags: Tags in CI/CD
Outerbounds
10 Metaflow Tags: Tags and Namespaces
Metaflow Tags: Tags and Namespaces
Outerbounds
11 Metaflow Tags: Tags and Continuous Training
Metaflow Tags: Tags and Continuous Training
Outerbounds
12 Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
13 Fireside Chat #5: Machine Learning + Infrastructure for Humans
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
14 Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
15 Metaflow on Azure
Metaflow on Azure
Outerbounds
16 Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
17 ML engineering vs traditional software engineering: similarities and differences
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
18 Why data scientists love and hate notebooks: velocity and validation
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
19 What even is a 10x ML engineer?
What even is a 10x ML engineer?
Outerbounds
20 The 4 main tasks in the production ML lifecycle
The 4 main tasks in the production ML lifecycle
Outerbounds
21 Is the premise of data-centric AI flawed?
Is the premise of data-centric AI flawed?
Outerbounds
22 The 3 factors that Determine the success of ML projects
The 3 factors that Determine the success of ML projects
Outerbounds
23 Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
24 Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
25 Metaflow on GCP
Metaflow on GCP
Outerbounds
26 Fireside Chat #8: Navigating the Full Stack of Machine Learning
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
27 How to Build a Full-Stack Recommender System
How to Build a Full-Stack Recommender System
Outerbounds
28 Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
29 Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
30 Fireside chat #9:  Language Processing: From Prototype to Production
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
31 How to build end-to-end recommender systems at reasonable scale
How to build end-to-end recommender systems at reasonable scale
Outerbounds
32 Full-Stack Machine Learning with Metaflow on CoRise
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
33 Natural Language Processing meets MLOps
Natural Language Processing meets MLOps
Outerbounds
34 Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
35 What even are Large Language Models?
What even are Large Language Models?
Outerbounds
36 How to get started with LLMs today
How to get started with LLMs today
Outerbounds
37 LLMs in production
LLMs in production
Outerbounds
38 Accessing secrets securely in Metaflow [no audio]
Accessing secrets securely in Metaflow [no audio]
Outerbounds
39 Fireside Chat #11: The Open-Source Modern Data Stack
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
40 Fireside chat #12: Kubernetes for Data Scientists
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
41 Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
42 Fireside chat #13: Supply Chain Security in Machine Learning
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
43 Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
44 Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
45 Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
46 From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
47 Building a GenAI Ready ML Platform with Metaflow at Autodesk
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
48 Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
49 Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
50 Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
51 The Past, Present, and Future of Generative AI
The Past, Present, and Future of Generative AI
Outerbounds
52 Building Production Systems with Generative AI, Machine Learning, and Data
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
54 Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
55 Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
56 Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
57 Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
58 Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
59 LLMs in Practice: A Guide to Recent Trends and Techniques
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
60 Metaflow for distributed high-performance computing and large-scale AI training
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds

This video series provides a comprehensive guide to fine-tuning a custom LLM using the Llama 2 model and combining it with RAG for improved performance. The series covers the basics of fine-tuning, RAG, and workflow management using Metaflow and Kubernetes. By following this series, viewers can learn how to fine-tune a custom LLM and improve its performance using RAG.

Key Takeaways
  1. Refactor a custom dataset to change the shape of the model's responses
  2. Curate the fine-tuning data set in a specific format
  3. Use the Hugging Face Trainer API to train the model with a custom quantization config and tokenizer
  4. Run the training script inside a Metaflow workflow to deploy and manage the training process
  5. Compute embeddings on a Triton inference server
  6. Push embeddings into the Pinecone index
  7. Fine-tune a custom LLM using a combination of open-source software models and RAGs
💡 Fine-tuning a custom LLM can produce higher quality relevance responses than pre-training a model from scratch, and combining fine-tuning with RAG can further improve performance.

Related AI Lessons

What Is RAG? The AI Technology That Makes ChatGPT Smarter Without Retraining
Learn about RAG, the AI technology that enhances ChatGPT's capabilities without requiring retraining, and why it matters for advancing language models
Medium · RAG
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Learn the limitations of linear RAG pipelines and how agentic workflows are becoming a popular alternative for more efficient and effective AI workflows
Medium · AI
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry
Medium · Machine Learning
Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On
Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry
Medium · Data Science
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →