Serving LLaMA2 with Replicate
Key Takeaways
The video demonstrates serving LLaMA2 with Replicate, a cloud platform that allows for fine-tuning and deployment of LLMs, and showcases its features, such as API tokens, Lang chain integration, and pricing based on hardware usage. It highlights the capabilities of Replicate in hosting LLaMA2 models, including the 70 billion parameter model, and discusses the costs and response times associated with using the platform.
Full Transcript
okay in this video I'm gonna look at serving the Llama to 70 billion model in the cloud and I'll probably do a few videos of different ways to do this this is one that I came across which I think is kind of interesting one I'll show you where you can sort of play with this for free but then also a service that you could use their API and pay per second of prediction so we'll have a look at that so if you can look here this is llama2.ai the domainelama2.ai allows us to play around with the 70 billion parameter and it's basically sponsored by a16z Venture Capital firm so one of the things that you can do here that I really like is you can come in and you can actually play with the system prompt in here now I don't think the hugging face One lets you do this so the cool thing here is that with the bigger models it pays more attention to the system prompt so you can see here I'm saying to it okay you are helpful but totally drunk assistant you slur your words and spell badly a lot so let's see when we say morning how are you to it how does it actually go with this okay we can see that our model is now starting to return back our drunk assistant and sure enough it seems to be slurring a lot of its words and also not hugely Belling but just sort of slurring spelling where it's basically using repeating characters and stuff again when I ask it can you tell me about the Olympics just to show you this you will find that at different times of the day the speed of the reply takes longer to come through I don't think this is longer actually Computing I think you're just waiting for it to basically reply and come back so you can see okay our thing has replied we've got this pretty unhelpful assistant here that we're being chatting to now what I want to do is jump in and look at the startup that is serving that model behind the scenes so this is replicate.com and you can see that they're serving a whole bunch of different models so you can serve private models here but they also have public apis for models that you can try out so I I think there are a number of companies doing this kind of thing one of them was Mosaic and they got bought I don't know what's happened now with them and they never seem to actually open up their inference API for people to use but here we've basically got replicates where we can go through you can see they're serving a bunch of the different image models we've got audio generation models but the thing that we're really after is the Llama 2 language models here and so sure enough we've got a number of different llamas here we click in and have a look at this model we can see that it's got an API that you can basically use for this we can see that it's running the system prompt like we tried on the Andreessen Horowitz one my guess is that Anderson Horowitz probably is an investor in this company and that's why they're using this but it has everything that we want to be able to run this in the cloud and even stream our responses back for this so if we jump in and have a look at the pricing the pricing here is determined basically by what Hardware you use so you can see here that we know that the Llama 2 model is running on an Nvidia a100 at GPU so that is basically costing 0.32 cents per second or 19 cents per minute now the difference here is you're not paying for just endless uptime you're only paying for when it's actually making predictions so only when you call it and it's running making your prediction and then sent back you're just paying for that time of the model this is quite different than say serving a model on the hugging face inference or on a lot of other things where you're paying for the actual time that the server is up and the GPU is up there this is quite a different thing now depending on I'm not going to say that this is always going to be cheaper I think if you're putting this into production you'd probably maybe better at looking at something else and serving it yourself but if you compare this to something like AWS where people are serving these models and it's often costing them over thirty dollars an hour to have that infrastructure up running and having the model running so in this case we don't need to do any of that there's not even a sort of cold start problem here of waiting for it because we're using the the public hosted version of this now we could host our own custom models in here as well then we would also have to pay for sort of the startup time that was going on for that as well so another good thing is when you actually sign up you can get your API token you don't need to put in a credit card straight away they will actually give you some amount of credits to try out the servers to test it out I encourage you even if you just want to see okay what's llama 270 billion and if you know I mess with this what will actually what will it be like rather than just go to the Llama dot AI website and just play with it if you want to actually play with it yourself that you can have come along here and have a look at that so they've got a whole docs section where you can basically use it with different kinds of services there's lots of examples here they've even got a collab here I'm actually not going to go through that one here that sort of seems to focus more on their image model stuff what I'm going to do is we're going to go through a notebook of using this with Lang chain and and looking at how you could use it with Lang chain so we're now in the codelab to basically look at using llama2 with replicate and using it with Lang chain and you see that you'll get your replicate API token and you'll put it in here and then you just import the llm as replicate and you set it up something like this you basically go to replicate and get the key for the model so you can see here this is using the Llama 13 billion this is in in one of their examples you run it through and then you will basically get this back you can also stream it out if we're streaming something out we can run through and we can see that okay the streaming will come out quite nicely and it's quite quick so this is using the the 13 billion even with the 70 billion you will see that the streaming is pretty decent in here so this is the 70 billion model and obviously the streaming is slower but we are getting the streaming coming through on collab it tends to go very wide here but we can see that okay we've got streaming going along nicely and this is a llama 2 model running at full resolution in here so I've just taken the notebook from the previous video and just converted that across it actually doesn't require much conversion at all you can see that okay here we've basically just gotten rid of the the pipeline for when we were running the 7B in here and we've now swapped it out with the llm so that we're running the 70b in here so we've got the summarization and stuff like that that we did in the previous video you'll see here that I've got streaming coming back and I've also got at the end it prints it out I'm sure so the streaming went on top it doesn't wrap it and where with this one it it is wrapping it we've got a simple chat bot that's the same as what I did in the previous video in future we'll have a look at using this with some tools and some other stuff as well you can see that here it's able to go through a sort of conversation and I've put in the time so we can just see the wall time of roughly how long these are taking to to predict so you will see when I asked the one about the Olympics it does actually take 62 seconds to come back so that's costing us around 19 cents it's not cheap to run these models in the cloud and this is what I often think that people give open AI a hard time for the cost there if you're going to run these things in production that you often find that running your own models can be very expensive but the advantage that you have here obviously is that you can fully fine tune this yourself and set it up the way that you want it as opposed to open AI currently where we can't do that you will find that most of the the responses are in pretty decent times for a 70 billion full resolution model so this just gives you one advantage of basically using replicate to serve this kind of model you can actually do the fine tuning on replicate as well I'm not going to be looking at that in this video but we just wanted to look at getting the full 70b model up so that we can then use it from some other things I'm also going to look at getting this going I know a lot of people are really eager with the four bit one so I'm trying in a bunch of different 4-bit ones to work out what I think is going to be the best for that currently anyway as always if you've got questions please feel free to put them in the comments below if you're interested in seeing more videos for this kind of stuff please click and subscribe and I will talk to you in the next video bye for now thank you
Original Description
Serving LLaMA2 with Replicate
Colab: https://drp.li/SBO4S
Replicate site: https://replicate.com/ #this video is not sponsored by them
For more tutorials on using LLMs and building Agents, check out my Patreon:
Patreon: https://www.patreon.com/SamWitteveen
Twitter: https://twitter.com/Sam_Witteveen
My Links:
Linkedin: https://www.linkedin.com/in/samwitteveen/
Github:
https://github.com/samwit/langchain-tutorials (updated)
https://github.com/samwit/llm-tutorials
00:00 Intro
00:26 Play around with LLaMA 2 Chatbot
01:51 Replicate.com
02:44 Replicate LLaMA 2 70B Chatbot
03:12 Replicate Pricing
05:21 Replicate Docs
05:44 Code Time
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Sam Witteveen · Sam Witteveen · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab
Sam Witteveen
LangChain Basics Tutorial #2 Tools and Chains
Sam Witteveen
ChatGPT API Announcement & Code Walkthrough with LangChain
Sam Witteveen
Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference
Sam Witteveen
LangChain - Conversations with Memory (explanation & code walkthrough)
Sam Witteveen
LangChain Chat with Flan20B
Sam Witteveen
LangChain - Using Hugging Face Models locally (code walkthrough)
Sam Witteveen
PAL : Program-aided Language Models with LangChain code
Sam Witteveen
Building a Summarization System with LangChain and GPT-3 - Part 1
Sam Witteveen
Building a Summarization System with LangChain and GPT-3 - Part 2
Sam Witteveen
Microsoft's Visual ChatGPT using LangChain
Sam Witteveen
Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo
Sam Witteveen
LangChain Agents - Joining Tools and Chains with Decisions
Sam Witteveen
Investigating Alpaca 7B - Finetuned LLaMa LLM
Sam Witteveen
Comparing LLMs with LangChain
Sam Witteveen
Running Alpaca7B in Colab
Sam Witteveen
How to finetune your own Alpaca 7B
Sam Witteveen
How to make a custom dataset like Alpaca7B
Sam Witteveen
Understanding Constitutional AI - the paper and key concepts
Sam Witteveen
Using Constitutional AI in LangChain
Sam Witteveen
Talking to Alpaca with LangChain - Creating an Alpaca Chatbot
Sam Witteveen
Text-to-video-synthesis with Diffusers and Colab
Sam Witteveen
Meet Dolly the new Alpaca model
Sam Witteveen
Checking out the Cerebras-GPT family of models
Sam Witteveen
A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)
Sam Witteveen
Is GPT4All your new personal ChatGPT?
Sam Witteveen
Raven - RWKV-7B RNN's LLM Strikes Back
Sam Witteveen
Talk to your CSV & Excel with LangChain
Sam Witteveen
Vicuna - 90% of ChatGPT quality by using a new dataset?
Sam Witteveen
Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍
Sam Witteveen
Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)
Sam Witteveen
BabyAGI: Discover the Power of Task-Driven Autonomous Agents!
Sam Witteveen
Auto-GPT - How to Automate a Task Based AI with GPT-4
Sam Witteveen
Improve your BabyAGI with LangChain
Sam Witteveen
Generative Agents - Deep Dive and GPT-4 Recreation
Sam Witteveen
GPT4ALLv2: The Improvements and Drawbacks You Need to Know!
Sam Witteveen
Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!
Sam Witteveen
Red Pajama - Operation: Freeing LLaMA
Sam Witteveen
Investigating Open Assistant - Models, Datasets and Addons
Sam Witteveen
Investigating MiniGPT-4 - The Secret behind GPT-V?
Sam Witteveen
Stable LM 3B - The new tiny kid on the block.
Sam Witteveen
Bard can now code and put that code in Colab for you.
Sam Witteveen
Checking out Bark: a Text to Speech system by Suno AI
Sam Witteveen
Fine-tuning LLMs with PEFT and LoRA
Sam Witteveen
Master PDF Chat with LangChain - Your essential guide to queries on documents
Sam Witteveen
Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools
Sam Witteveen
Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)
Sam Witteveen
StableVicuna: The New King of Open ChatGPTs?
Sam Witteveen
WizardLM: Evolving Instruction Datasets to Create a Better Model
Sam Witteveen
LaMini-LM - Mini Models Maxi Data!
Sam Witteveen
Finding the Best Free ChatGPT
Sam Witteveen
MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model
Sam Witteveen
LangChain Retrieval QA Over Multiple Files with ChromaDB
Sam Witteveen
LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs
Sam Witteveen
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!
Sam Witteveen
Transformers Agent - Is this Hugging Face's LangChain Competitor?
Sam Witteveen
StarCoder - The LLM to make you a coding star?
Sam Witteveen
Testing Starcoder for Reasoning with PAL
Sam Witteveen
The New Wizards - Unfiltered & Unaligned
Sam Witteveen
Camel + LangChain for Synthetic Data & Market Research
Sam Witteveen
More on: LLM Foundations
View skill →Related Reads
📰
📰
📰
📰
LLM Tokens Explained: Cost, Memory, Speed and Context Windows
Medium · AI
5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)
Medium · Machine Learning
5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)
Medium · LLM
Arquitetura Cognitiva de Baixa Latência: Padrão RAG com Cache
Medium · Python
Chapters (7)
Intro
0:26
Play around with LLaMA 2 Chatbot
1:51
Replicate.com
2:44
Replicate LLaMA 2 70B Chatbot
3:12
Replicate Pricing
5:21
Replicate Docs
5:44
Code Time
🎓
Tutor Explanation
DeepCamp AI