A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

Outerbounds · Beginner ·🔍 RAG & Vector Search ·2y ago

Skills: Fine-tuning LLMs90%LLM Foundations80%RAG Basics80%Vector Stores70%

Key Takeaways

This video series demonstrates fine-tuning a custom LLM using the Llama 2 model, Hugging Face APIs, and Google Colab, with a focus on retrieval augmented generation (RAG) and combining fine-tuning with RAG for improved performance. The series also covers using Metaflow with Kubernetes and Pipie decorators for workflow management and running on a Triton inference server for high-performance computing.

Full Transcript

let's do it man let's just go quickly through the slides they tell they tell the story a little bit and then we can pop into the fine tuning code and look at more hugging apis and this is the last part of the workshop as well yes yes and this part is going to be a little hands-off like we we'll do full on fine tuning workshops in the future where this is like this is very handwavy to be honest um cool so fully custom llm now we're like changing the dyam of this picture by making everything dark purple where what we mean is basically like if you have enough relevant data and you have enough compute capacity you you can make your own model just do the pre-training part yourself um and the idea is that hopefully if you had enough relevant data that's within the distribution you care about when your users are going to be interacting with the the model and your product that would produce the highest quality of relevance responses now in reality like getting that much data dat and training a model that size there's only a few organizations in the world that can do it and they're training on data sets that are like the entire internet so it's like kind of subject or like suspect in my opinion to say that that's going to be super relevant to anyone when it's that General it's almost a built-in tradeoff now there's this other world of like fine-tuning where it's like you have a data set maybe your data set is like is quite a bit bigger it's like five to 10 times bigger maybe even a couple orders of magnitude bigger than the kind of data sets we're talking about with rag um or or maybe not maybe maybe some models like you don't need that much data to do fine tuning um it's really a case-by case basis that's that's worth studying as a data scientist if you're into this stuff these days um but what you can do is you can actually keep training the llm so you're not pre-training you're taking like a fork or or transfer learning there's like a variation on transfer learning um you're taking a fork of the model's State and you're saying like Okay now here's a new data set maybe it has a little bit of different different shape or there's like this trend called instruction tuning where you format the like labeled examples that you give the llm very specifically um and then uh you use that to continue training the model so you're actually updating the weights the parameters of the neural network um of course you could also combine this with rag so there's like a lot of interesting research around this um I've even seen like rag being combined into pre-training as like a recent project from um so there's a whole bunch of different combinations is the point um but the the core idea of what we were talking about here is more this case of fine tuning where it's like I actually want to change this blob and make it more dark purple by training on my data training on a specific shape of data that I think is relevant to our apps and use cases things like this so how do we actually do this there a ton of good examples on the internet these days um but one way is you can Define your model scripts so this is not too different from a Google collab notebook I found on the internet um just like slight refactoring and little bit of like adding a little extra optimizations on it um it should be runable on a single V100 if you have a V100 GPU um and what it's doing is fine-tuning a Cura of the seven billion parameter llama 2 model so that was a bunch of jargon um I think maybe we should talk about llama a little bit Hugo do you want to like say a few things about kind of the Llama model family or anything come to mind there well yeah I mean a short introduction to everything you just said is llama's a family that was open sourced by meta um AI um pretty pretty hefty big big models I mean there's a range um you know but did you say did you say 7B yeah this one's the small one the 7D is the so but it's still big right and and so um then what uh Laura does it's it's a forma dimensionality reduction um it uses linear algebra essentially to to to reduce the the size of it in in very clear ways um the clever ways sorry the queue is quantized um I um which is a technical detail I don't think we need to get into at the moment but that's kind of I suppose my very tldr on what like Cur on llama 7B Lama 27b would mean or something is there anything else you'd add to that no I think I think that's great um so like in addition to like like adding a little bit of stuff from the collab notebook I found is like the framework but also I'm sorry yeah to to your point though even though we have a small model and we're reducing it even more we still need gpus right so right like this this is just not this one's not very feasible to run on a CPU it would be like it it would take way too long for the model to learn anything for that to be really worth your time or or honestly it might even be more expensive because you need the computer to be running long um okay anyways um basically what's going on here is we're sort of refactoring our custom data set um we're pulling this this nice instruction tuning data set that the data bricks team put together called Dolly 15K and we're kind of packaging it in a slightly different way this is what I mean when I was saying like you can change the shape of how the model looks to continue its responses by fine tuning so what you can do is kind of like unpack um the data in a specific format that's like how you would want the model to respond at runtime and then you can kind of curate your fine-tuning data set to be in a shape like this and it sort of just directs the model about how to respond to things um like this could be another good use case of like I want this specific structure only as the output of my data you just make a really good labeled instruction tuning data set like this could work um what else is interesting here we see the same API for importing our model so just like we were doing before in the tinier model use case like just write in your sandbox notebook with no gpus or anything it's the same thing use this from pre-train API we have our quantization config um and then we're just picking a specific flavor of the Llama model um a few other hugging face bells and whistles here of course there's the tokenizer piece um and then the API that we're choosing to use is the sort of the highest level API for model training and hugging face is this um object called the trainer which takes in another object called training arguments this is where we specify a lot of stuff like how to use the gpus like kind of what way do you accumulate gradients um um you pass it an Optimizer so you can kind of like get into this like more like pytorch layer as well in the trainer arguments but there's a lot of stuff going on here and then it's a simple trainer. train now what the kind of the interesting thing like storywise here like we won't really go too much into the code um but we're running this whole script inside of a metaflow workflow and one of the reasons we're doing that is because then we can take this model after it's been trained zip zip up the whole resulting q and push it to S3 and metap makes this very easy to kind of like wire this up when we're running it inside of our workflow and um like specifically like look at how few lines of code is inside of our metlow structure where we're using the at kubernetes decorator um and the at pipie decorator that Hugo mentioned earlier and um yeah we're just kind of making it all happen so that we can then unpack that model on the server and actually run run this run this um on a Triton inference server in this case is what the example shows here um super high level introduction to the fine-tuning and surveying aspect of this Workshop um Hugo I think we should probably do a full Workshop just on probably each of those pieces independently at some point I was just going to say that and and the reason we've decoupled it like that is every all the code we've run so far you've been able to do in the sandbox um and with with um you know relatively limited compute and and time whereas this You' need pretty significant Hardware as well so you know um that's why we wanted to to do it this way um but hey so it looks like we're going back to the Argo Argo workflows yeah quickly before my battery actually dies we can see that this first workflow ran uh we can go back and see that our second workflow also ran to completion again automatically triggered and our third workflow is now in Flight it finished the start process right now it's kind of doing the heavy compute step which is actually Compu Computing the embeddings and all the metaflow documentation and then it's going to push those into the pine cone index um so that pretty much the story and we're out of slides and out of code in the sandbox so far um any thoughts youo to to wrap up the session look I'd actually just like go to go back I'm going to share my screen again and just go back to our um can you see my slides our slides yeah I'm going to quickly move and just make sure I can go get a charger before my thing dies as you're doing this sure sure um what I wanted to do is just go back to when we talked about just to remind you everything we've we've done and if you coded along everything you did which you know everyone should be super congratulated um you've you've learned and played around with hitting uh llm apis um you've gone through and and thought through with us how you can be more flexible with open source software models um thinking about Rags how to great rathon and thinking about better uh relevancy there going through the metaflow docs which was super fun um then productionizing Rags then also going through this is the code we didn't didn't execute but going through fine-tuning a custom llm so all in just over a couple of hours you know and if you did this at 1.5x under a couple of hours um you've managed to learn a huge amount uh about the space um as I said uh in the description we'll put the link to our uh slack community where we love chatting about all of this all of this type of stuff very much welcome feedback um anything else you'd like to see in the sandbox uh we would' love to show you and and produce at some point as well um but I think with that that's a really nice note to wrap up on Eddie what do you think sounds great it was super fun Hugo we should definitely do this more often that was that was an incredible amount of fun and thank you all for for watching and sticking around until the end and we'll we'll see you on slack all right ciao yeah

Original Description

This is a 6 video series interactive guided tour to LLMs, RAG, & Fine-Tuning. The playlist is here: https://youtube.com/playlist?list=PLUsOvkBBnJBcZglk6QQyKGZsgEzClGnv-&si=66stnfv3-HXa60m9 You can also watch the full workshop here: https://youtu.be/uDBGwQ7JAzQ In this workshop, attendees will learn about methods for working with LLMs. Our stories will be guided by examples you can run on your laptop or in a (free) hosted cloud environment provided to attendees. Developers will expand their awareness of how researchers and product designers are working with LLMs, with emphasis on connecting high-level concepts such as fine-tuning and vector databases to the fundamental math and APIs data scientists should understand. Business-minded executives can either get hands-on or follow the higher-level stories to deepen their sense of what is possible with LLMs, the technicalities behind risks they introduce, and how they fit into the arc of ML. The primary value of this workshop will be as a guide to help teams set reasonable goals in the complex and fast-moving world of LLMs, and understand what you need to successfully support your team’s next LLM projects. What You’ll Learn: There are cheap (e.g., APIs) and expensive (e.g., fine-tuning, training) ways to build on top of LLMs. The methods you choose have consequences in apps you can build and how your dev team works. We will learn how to think about these choices as we develop basic apps you can use as templates for future genAI projects. Learners have the option to follow along in a provided dev environment where we will unpack these choices and make the tradeoffs and decision space concrete. The Github repository is here: https://github.com/outerbounds/generative-ai-summit-austin-2023

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 53 of 60

← Previous Next →

Metaflow GUI for monitoring machine learning workflows

Metaflow GUI for monitoring machine learning workflows

Metaflow Cards [no sound]

Metaflow Cards [no sound]

Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning

Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning

Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning

Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning

Metaflow on Kubernetes and Argo Workflows [no sound]

Metaflow on Kubernetes and Argo Workflows [no sound]

Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK

Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK

Metaflow Tags: Programmatic Tagging

Metaflow Tags: Programmatic Tagging

Metaflow Tags: Basic Tagging

Metaflow Tags: Basic Tagging

Metaflow Tags: Tags in CI/CD

Metaflow Tags: Tags in CI/CD

Metaflow Tags: Tags and Namespaces

Metaflow Tags: Tags and Namespaces

Metaflow Tags: Tags and Continuous Training

Metaflow Tags: Tags and Continuous Training

Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People

Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People

Fireside Chat #5: Machine Learning + Infrastructure for Humans

Fireside Chat #5: Machine Learning + Infrastructure for Humans

Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser

Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser

Metaflow on Azure

Metaflow on Azure

Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners

Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners

ML engineering vs traditional software engineering: similarities and differences

ML engineering vs traditional software engineering: similarities and differences

Why data scientists love and hate notebooks: velocity and validation

Why data scientists love and hate notebooks: velocity and validation

What even is a 10x ML engineer?

What even is a 10x ML engineer?

The 4 main tasks in the production ML lifecycle

The 4 main tasks in the production ML lifecycle

Is the premise of data-centric AI flawed?

Is the premise of data-centric AI flawed?

The 3 factors that Determine the success of ML projects

The 3 factors that Determine the success of ML projects

Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch

Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch

Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]

Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]

Metaflow on GCP

Metaflow on GCP

Fireside Chat #8: Navigating the Full Stack of Machine Learning

Fireside Chat #8: Navigating the Full Stack of Machine Learning

How to Build a Full-Stack Recommender System

How to Build a Full-Stack Recommender System

Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]

Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]

Easy Airflow DAGs for ML and data science with Metaflow [no sound]

Easy Airflow DAGs for ML and data science with Metaflow [no sound]

Fireside chat #9: Language Processing: From Prototype to Production

Fireside chat #9: Language Processing: From Prototype to Production

How to build end-to-end recommender systems at reasonable scale

How to build end-to-end recommender systems at reasonable scale

Full-Stack Machine Learning with Metaflow on CoRise

Full-Stack Machine Learning with Metaflow on CoRise

Natural Language Processing meets MLOps

Natural Language Processing meets MLOps

Fireside Chat #10: Large Language Models: Beyond Proofs of Concept

Fireside Chat #10: Large Language Models: Beyond Proofs of Concept

What even are Large Language Models?

What even are Large Language Models?

How to get started with LLMs today

How to get started with LLMs today

LLMs in production

LLMs in production

Accessing secrets securely in Metaflow [no audio]

Accessing secrets securely in Metaflow [no audio]

Fireside Chat #11: The Open-Source Modern Data Stack

Fireside Chat #11: The Open-Source Modern Data Stack

Fireside chat #12: Kubernetes for Data Scientists

Fireside chat #12: Kubernetes for Data Scientists

Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster

Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster

Fireside chat #13: Supply Chain Security in Machine Learning

Fireside chat #13: Supply Chain Security in Machine Learning

Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story

Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story

Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai

Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai

Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration

Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration

From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo

From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo

Building a GenAI Ready ML Platform with Metaflow at Autodesk

Building a GenAI Ready ML Platform with Metaflow at Autodesk

Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis

Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis

Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform

Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform

Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming

Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming

The Past, Present, and Future of Generative AI

The Past, Present, and Future of Generative AI

Building Production Systems with Generative AI, Machine Learning, and Data

Building Production Systems with Generative AI, Machine Learning, and Data

A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)

Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)

Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)

Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)

Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)

Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)

Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)

Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)

Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)

Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)

LLMs in Practice: A Guide to Recent Trends and Techniques

LLMs in Practice: A Guide to Recent Trends and Techniques

Metaflow for distributed high-performance computing and large-scale AI training

Metaflow for distributed high-performance computing and large-scale AI training

This video series provides a comprehensive guide to fine-tuning a custom LLM using the Llama 2 model and combining it with RAG for improved performance. The series covers the basics of fine-tuning, RAG, and workflow management using Metaflow and Kubernetes. By following this series, viewers can learn how to fine-tune a custom LLM and improve its performance using RAG.

Key Takeaways

Refactor a custom dataset to change the shape of the model's responses
Curate the fine-tuning data set in a specific format
Use the Hugging Face Trainer API to train the model with a custom quantization config and tokenizer
Run the training script inside a Metaflow workflow to deploy and manage the training process
Compute embeddings on a Triton inference server
Push embeddings into the Pinecone index
Fine-tune a custom LLM using a combination of open-source software models and RAGs

💡 Fine-tuning a custom LLM can produce higher quality relevance responses than pre-training a model from scratch, and combining fine-tuning with RAG can further improve performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Fine-tuning LLMs

View skill →

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Fine-tuning T5 LLM for Text Generation: Complete Tutorial w/ free COLAB #coding

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Train image classifier using transfer learning - Fine-tuning MobileNet with Keras

Advanced Fine-Tuning in Rust

Advanced Fine-Tuning in Rust

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

LLM Fine-tuning: Two Crucial Tips for New Models - LLama 2

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

SDXL LORA STYLE Training! Get THE PERFECT RESULTS!

Related AI Lessons

What Is RAG? The AI Technology That Makes ChatGPT Smarter Without Retraining

Learn about RAG, the AI technology that enhances ChatGPT's capabilities without requiring retraining, and why it matters for advancing language models

Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On

Learn the limitations of linear RAG pipelines and how agentic workflows are becoming a popular alternative for more efficient and effective AI workflows

Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On

Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry

Medium · Machine Learning

Understanding the Limits of Linear RAG — and Why Agentic Workflows Are Catching On

Learn why linear RAG pipelines have limitations and how Agentic workflows are becoming a preferred alternative in the industry

Medium · Data Science

RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python

Professor Py: AI Engineering