Serving LLaMA2 with Replicate

Sam Witteveen · Beginner ·🧠 Large Language Models ·2y ago

Key Takeaways

The video demonstrates serving LLaMA2 with Replicate, a cloud platform that allows for fine-tuning and deployment of LLMs, and showcases its features, such as API tokens, Lang chain integration, and pricing based on hardware usage. It highlights the capabilities of Replicate in hosting LLaMA2 models, including the 70 billion parameter model, and discusses the costs and response times associated with using the platform.

Full Transcript

okay in this video I'm gonna look at serving the Llama to 70 billion model in the cloud and I'll probably do a few videos of different ways to do this this is one that I came across which I think is kind of interesting one I'll show you where you can sort of play with this for free but then also a service that you could use their API and pay per second of prediction so we'll have a look at that so if you can look here this is llama2.ai the domainelama2.ai allows us to play around with the 70 billion parameter and it's basically sponsored by a16z Venture Capital firm so one of the things that you can do here that I really like is you can come in and you can actually play with the system prompt in here now I don't think the hugging face One lets you do this so the cool thing here is that with the bigger models it pays more attention to the system prompt so you can see here I'm saying to it okay you are helpful but totally drunk assistant you slur your words and spell badly a lot so let's see when we say morning how are you to it how does it actually go with this okay we can see that our model is now starting to return back our drunk assistant and sure enough it seems to be slurring a lot of its words and also not hugely Belling but just sort of slurring spelling where it's basically using repeating characters and stuff again when I ask it can you tell me about the Olympics just to show you this you will find that at different times of the day the speed of the reply takes longer to come through I don't think this is longer actually Computing I think you're just waiting for it to basically reply and come back so you can see okay our thing has replied we've got this pretty unhelpful assistant here that we're being chatting to now what I want to do is jump in and look at the startup that is serving that model behind the scenes so this is replicate.com and you can see that they're serving a whole bunch of different models so you can serve private models here but they also have public apis for models that you can try out so I I think there are a number of companies doing this kind of thing one of them was Mosaic and they got bought I don't know what's happened now with them and they never seem to actually open up their inference API for people to use but here we've basically got replicates where we can go through you can see they're serving a bunch of the different image models we've got audio generation models but the thing that we're really after is the Llama 2 language models here and so sure enough we've got a number of different llamas here we click in and have a look at this model we can see that it's got an API that you can basically use for this we can see that it's running the system prompt like we tried on the Andreessen Horowitz one my guess is that Anderson Horowitz probably is an investor in this company and that's why they're using this but it has everything that we want to be able to run this in the cloud and even stream our responses back for this so if we jump in and have a look at the pricing the pricing here is determined basically by what Hardware you use so you can see here that we know that the Llama 2 model is running on an Nvidia a100 at GPU so that is basically costing 0.32 cents per second or 19 cents per minute now the difference here is you're not paying for just endless uptime you're only paying for when it's actually making predictions so only when you call it and it's running making your prediction and then sent back you're just paying for that time of the model this is quite different than say serving a model on the hugging face inference or on a lot of other things where you're paying for the actual time that the server is up and the GPU is up there this is quite a different thing now depending on I'm not going to say that this is always going to be cheaper I think if you're putting this into production you'd probably maybe better at looking at something else and serving it yourself but if you compare this to something like AWS where people are serving these models and it's often costing them over thirty dollars an hour to have that infrastructure up running and having the model running so in this case we don't need to do any of that there's not even a sort of cold start problem here of waiting for it because we're using the the public hosted version of this now we could host our own custom models in here as well then we would also have to pay for sort of the startup time that was going on for that as well so another good thing is when you actually sign up you can get your API token you don't need to put in a credit card straight away they will actually give you some amount of credits to try out the servers to test it out I encourage you even if you just want to see okay what's llama 270 billion and if you know I mess with this what will actually what will it be like rather than just go to the Llama dot AI website and just play with it if you want to actually play with it yourself that you can have come along here and have a look at that so they've got a whole docs section where you can basically use it with different kinds of services there's lots of examples here they've even got a collab here I'm actually not going to go through that one here that sort of seems to focus more on their image model stuff what I'm going to do is we're going to go through a notebook of using this with Lang chain and and looking at how you could use it with Lang chain so we're now in the codelab to basically look at using llama2 with replicate and using it with Lang chain and you see that you'll get your replicate API token and you'll put it in here and then you just import the llm as replicate and you set it up something like this you basically go to replicate and get the key for the model so you can see here this is using the Llama 13 billion this is in in one of their examples you run it through and then you will basically get this back you can also stream it out if we're streaming something out we can run through and we can see that okay the streaming will come out quite nicely and it's quite quick so this is using the the 13 billion even with the 70 billion you will see that the streaming is pretty decent in here so this is the 70 billion model and obviously the streaming is slower but we are getting the streaming coming through on collab it tends to go very wide here but we can see that okay we've got streaming going along nicely and this is a llama 2 model running at full resolution in here so I've just taken the notebook from the previous video and just converted that across it actually doesn't require much conversion at all you can see that okay here we've basically just gotten rid of the the pipeline for when we were running the 7B in here and we've now swapped it out with the llm so that we're running the 70b in here so we've got the summarization and stuff like that that we did in the previous video you'll see here that I've got streaming coming back and I've also got at the end it prints it out I'm sure so the streaming went on top it doesn't wrap it and where with this one it it is wrapping it we've got a simple chat bot that's the same as what I did in the previous video in future we'll have a look at using this with some tools and some other stuff as well you can see that here it's able to go through a sort of conversation and I've put in the time so we can just see the wall time of roughly how long these are taking to to predict so you will see when I asked the one about the Olympics it does actually take 62 seconds to come back so that's costing us around 19 cents it's not cheap to run these models in the cloud and this is what I often think that people give open AI a hard time for the cost there if you're going to run these things in production that you often find that running your own models can be very expensive but the advantage that you have here obviously is that you can fully fine tune this yourself and set it up the way that you want it as opposed to open AI currently where we can't do that you will find that most of the the responses are in pretty decent times for a 70 billion full resolution model so this just gives you one advantage of basically using replicate to serve this kind of model you can actually do the fine tuning on replicate as well I'm not going to be looking at that in this video but we just wanted to look at getting the full 70b model up so that we can then use it from some other things I'm also going to look at getting this going I know a lot of people are really eager with the four bit one so I'm trying in a bunch of different 4-bit ones to work out what I think is going to be the best for that currently anyway as always if you've got questions please feel free to put them in the comments below if you're interested in seeing more videos for this kind of stuff please click and subscribe and I will talk to you in the next video bye for now thank you

Original Description

Serving LLaMA2 with Replicate Colab: https://drp.li/SBO4S Replicate site: https://replicate.com/ #this video is not sponsored by them For more tutorials on using LLMs and building Agents, check out my Patreon: Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen My Links: Linkedin: https://www.linkedin.com/in/samwitteveen/ Github: https://github.com/samwit/langchain-tutorials (updated) https://github.com/samwit/llm-tutorials 00:00 Intro 00:26 Play around with LLaMA 2 Chatbot 01:51 Replicate.com 02:44 Replicate LLaMA 2 70B Chatbot 03:12 Replicate Pricing 05:21 Replicate Docs 05:44 Code Time
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sam Witteveen · Sam Witteveen · 0 of 60

← Previous Next →
1 LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
Sam Witteveen
2 LangChain Basics Tutorial #2 Tools and Chains
LangChain Basics Tutorial #2 Tools and Chains
Sam Witteveen
3 ChatGPT API Announcement & Code Walkthrough with LangChain
ChatGPT API Announcement & Code Walkthrough with LangChain
Sam Witteveen
4 Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Sam Witteveen
5 LangChain - Conversations with Memory (explanation & code walkthrough)
LangChain - Conversations with Memory (explanation & code walkthrough)
Sam Witteveen
6 LangChain Chat with Flan20B
LangChain Chat with Flan20B
Sam Witteveen
7 LangChain - Using Hugging Face Models locally (code walkthrough)
LangChain - Using Hugging Face Models locally (code walkthrough)
Sam Witteveen
8 PAL : Program-aided Language Models with LangChain code
PAL : Program-aided Language Models with LangChain code
Sam Witteveen
9 Building a Summarization System with LangChain and GPT-3 - Part 1
Building a Summarization System with LangChain and GPT-3 - Part 1
Sam Witteveen
10 Building a Summarization System with LangChain and GPT-3 - Part 2
Building a Summarization System with LangChain and GPT-3 - Part 2
Sam Witteveen
11 Microsoft's Visual ChatGPT using LangChain
Microsoft's Visual ChatGPT using LangChain
Sam Witteveen
12 Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Sam Witteveen
13 LangChain Agents - Joining Tools and Chains with Decisions
LangChain Agents - Joining Tools and Chains with Decisions
Sam Witteveen
14 Investigating Alpaca 7B - Finetuned LLaMa LLM
Investigating Alpaca 7B - Finetuned LLaMa LLM
Sam Witteveen
15 Comparing LLMs with LangChain
Comparing LLMs with LangChain
Sam Witteveen
16 Running Alpaca7B in Colab
Running Alpaca7B in Colab
Sam Witteveen
17 How to finetune your own Alpaca 7B
How to finetune your own Alpaca 7B
Sam Witteveen
18 How to make a custom dataset like Alpaca7B
How to make a custom dataset like Alpaca7B
Sam Witteveen
19 Understanding Constitutional AI - the paper and key concepts
Understanding Constitutional AI - the paper and key concepts
Sam Witteveen
20 Using Constitutional AI in LangChain
Using Constitutional AI in LangChain
Sam Witteveen
21 Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Sam Witteveen
22 Text-to-video-synthesis with Diffusers and Colab
Text-to-video-synthesis with Diffusers and Colab
Sam Witteveen
23 Meet Dolly the new Alpaca model
Meet Dolly the new Alpaca model
Sam Witteveen
24 Checking out the Cerebras-GPT family of models
Checking out the Cerebras-GPT family of models
Sam Witteveen
25 A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
Sam Witteveen
26 Is GPT4All your new personal ChatGPT?
Is GPT4All your new personal ChatGPT?
Sam Witteveen
27 Raven - RWKV-7B RNN's LLM Strikes Back
Raven - RWKV-7B RNN's LLM Strikes Back
Sam Witteveen
28 Talk to your CSV & Excel with LangChain
Talk to your CSV & Excel with LangChain
Sam Witteveen
29 Vicuna - 90% of ChatGPT quality by using a new dataset?
Vicuna - 90% of ChatGPT quality by using a new dataset?
Sam Witteveen
30 Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Sam Witteveen
31 Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Sam Witteveen
32 BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
Sam Witteveen
33 Auto-GPT - How to Automate a Task Based AI with GPT-4
Auto-GPT - How to Automate a Task Based AI with GPT-4
Sam Witteveen
34 Improve your BabyAGI with LangChain
Improve your BabyAGI with LangChain
Sam Witteveen
35 Generative Agents - Deep Dive and GPT-4 Recreation
Generative Agents - Deep Dive and GPT-4 Recreation
Sam Witteveen
36 GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
Sam Witteveen
37 Dolly 2.0 by Databricks: Open for Business but is it  Ready to Impress!
Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!
Sam Witteveen
38 Red Pajama - Operation: Freeing LLaMA
Red Pajama - Operation: Freeing LLaMA
Sam Witteveen
39 Investigating Open Assistant - Models, Datasets and Addons
Investigating Open Assistant - Models, Datasets and Addons
Sam Witteveen
40 Investigating MiniGPT-4 - The Secret behind GPT-V?
Investigating MiniGPT-4 - The Secret behind GPT-V?
Sam Witteveen
41 Stable LM 3B - The new tiny kid on the block.
Stable LM 3B - The new tiny kid on the block.
Sam Witteveen
42 Bard can now code and put that code in Colab for you.
Bard can now code and put that code in Colab for you.
Sam Witteveen
43 Checking out Bark: a Text to Speech system by Suno AI
Checking out Bark: a Text to Speech system by Suno AI
Sam Witteveen
44 Fine-tuning LLMs with PEFT and LoRA
Fine-tuning LLMs with PEFT and LoRA
Sam Witteveen
45 Master PDF Chat with LangChain - Your essential guide to queries on documents
Master PDF Chat with LangChain - Your essential guide to queries on documents
Sam Witteveen
46 Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Sam Witteveen
47 Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Sam Witteveen
48 StableVicuna: The New King of Open ChatGPTs?
StableVicuna: The New King of Open ChatGPTs?
Sam Witteveen
49 WizardLM: Evolving Instruction Datasets to Create a Better Model
WizardLM: Evolving Instruction Datasets to Create a Better Model
Sam Witteveen
50 LaMini-LM - Mini Models Maxi Data!
LaMini-LM - Mini Models Maxi Data!
Sam Witteveen
51 Finding the Best Free ChatGPT
Finding the Best Free ChatGPT
Sam Witteveen
52 MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
Sam Witteveen
53 LangChain Retrieval QA Over Multiple Files with ChromaDB
LangChain Retrieval QA Over Multiple Files with ChromaDB
Sam Witteveen
54 LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
Sam Witteveen
55 LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
Sam Witteveen
56 Transformers Agent - Is this Hugging Face's LangChain Competitor?
Transformers Agent - Is this Hugging Face's LangChain Competitor?
Sam Witteveen
57 StarCoder - The LLM to make you a coding star?
StarCoder - The LLM to make you a coding star?
Sam Witteveen
58 Testing Starcoder for Reasoning with PAL
Testing Starcoder for Reasoning with PAL
Sam Witteveen
59 The New Wizards - Unfiltered & Unaligned
The New Wizards - Unfiltered & Unaligned
Sam Witteveen
60 Camel + LangChain for Synthetic Data & Market Research
Camel + LangChain for Synthetic Data & Market Research
Sam Witteveen

This video teaches how to serve LLaMA2 with Replicate, a cloud platform that allows for fine-tuning and deployment of LLMs. It covers the features and capabilities of Replicate, including API tokens, Lang chain integration, and pricing based on hardware usage. By following this tutorial, viewers can learn how to deploy and fine-tune LLaMA2 models on Replicate and use them for natural language processing tasks.

Key Takeaways
  1. Import the LLaMA2 model using Replicate API token
  2. Set up the LLaMA2 model with Lang chain
  3. Run the LLaMA2 model with the 13 billion parameters
  4. Run the LLaMA2 model with the 70 billion parameters
  5. Stream the output of the LLaMA2 model
  6. Fine-tune the LLaMA2 model on Replicate
  7. Configure the LLaMA2 model on Replicate for optimal performance
💡 Replicate allows for full control over model setup and provides a cost-effective way to deploy and fine-tune LLaMA2 models, making it an attractive option for developers and researchers working with LLMs.

Related Reads

📰
LLM Tokens Explained: Cost, Memory, Speed and Context Windows
Understand LLM tokens and their impact on cost, memory, speed, and context windows to optimize your language model usage
Medium · AI
📰
5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)
Learn about the top time-aware memory layers for building long-term AI agents and improve their performance in stateful interactions
Medium · Machine Learning
📰
5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)
Learn about the 5 best time-aware memory layers for long-term AI agents and improve their performance in autonomous tasks
Medium · LLM
📰
Arquitetura Cognitiva de Baixa Latência: Padrão RAG com Cache
Learn how to implement low-latency cognitive architecture using the RAG pattern with cache to overcome corporate adoption hurdles of Generative AI
Medium · Python

Chapters (7)

Intro
0:26 Play around with LLaMA 2 Chatbot
1:51 Replicate.com
2:44 Replicate LLaMA 2 70B Chatbot
3:12 Replicate Pricing
5:21 Replicate Docs
5:44 Code Time
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →