Serving LLaMA2 with Replicate

Sam Witteveen · Beginner ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

Key Takeaways

The video demonstrates serving LLaMA2 with Replicate, a cloud platform that allows for fine-tuning and deployment of LLMs, and showcases its features, such as API tokens, Lang chain integration, and pricing based on hardware usage. It highlights the capabilities of Replicate in hosting LLaMA2 models, including the 70 billion parameter model, and discusses the costs and response times associated with using the platform.

Full Transcript

okay in this video I'm gonna look at serving the Llama to 70 billion model in the cloud and I'll probably do a few videos of different ways to do this this is one that I came across which I think is kind of interesting one I'll show you where you can sort of play with this for free but then also a service that you could use their API and pay per second of prediction so we'll have a look at that so if you can look here this is llama2.ai the domainelama2.ai allows us to play around with the 70 billion parameter and it's basically sponsored by a16z Venture Capital firm so one of the things that you can do here that I really like is you can come in and you can actually play with the system prompt in here now I don't think the hugging face One lets you do this so the cool thing here is that with the bigger models it pays more attention to the system prompt so you can see here I'm saying to it okay you are helpful but totally drunk assistant you slur your words and spell badly a lot so let's see when we say morning how are you to it how does it actually go with this okay we can see that our model is now starting to return back our drunk assistant and sure enough it seems to be slurring a lot of its words and also not hugely Belling but just sort of slurring spelling where it's basically using repeating characters and stuff again when I ask it can you tell me about the Olympics just to show you this you will find that at different times of the day the speed of the reply takes longer to come through I don't think this is longer actually Computing I think you're just waiting for it to basically reply and come back so you can see okay our thing has replied we've got this pretty unhelpful assistant here that we're being chatting to now what I want to do is jump in and look at the startup that is serving that model behind the scenes so this is replicate.com and you can see that they're serving a whole bunch of different models so you can serve private models here but they also have public apis for models that you can try out so I I think there are a number of companies doing this kind of thing one of them was Mosaic and they got bought I don't know what's happened now with them and they never seem to actually open up their inference API for people to use but here we've basically got replicates where we can go through you can see they're serving a bunch of the different image models we've got audio generation models but the thing that we're really after is the Llama 2 language models here and so sure enough we've got a number of different llamas here we click in and have a look at this model we can see that it's got an API that you can basically use for this we can see that it's running the system prompt like we tried on the Andreessen Horowitz one my guess is that Anderson Horowitz probably is an investor in this company and that's why they're using this but it has everything that we want to be able to run this in the cloud and even stream our responses back for this so if we jump in and have a look at the pricing the pricing here is determined basically by what Hardware you use so you can see here that we know that the Llama 2 model is running on an Nvidia a100 at GPU so that is basically costing 0.32 cents per second or 19 cents per minute now the difference here is you're not paying for just endless uptime you're only paying for when it's actually making predictions so only when you call it and it's running making your prediction and then sent back you're just paying for that time of the model this is quite different than say serving a model on the hugging face inference or on a lot of other things where you're paying for the actual time that the server is up and the GPU is up there this is quite a different thing now depending on I'm not going to say that this is always going to be cheaper I think if you're putting this into production you'd probably maybe better at looking at something else and serving it yourself but if you compare this to something like AWS where people are serving these models and it's often costing them over thirty dollars an hour to have that infrastructure up running and having the model running so in this case we don't need to do any of that there's not even a sort of cold start problem here of waiting for it because we're using the the public hosted version of this now we could host our own custom models in here as well then we would also have to pay for sort of the startup time that was going on for that as well so another good thing is when you actually sign up you can get your API token you don't need to put in a credit card straight away they will actually give you some amount of credits to try out the servers to test it out I encourage you even if you just want to see okay what's llama 270 billion and if you know I mess with this what will actually what will it be like rather than just go to the Llama dot AI website and just play with it if you want to actually play with it yourself that you can have come along here and have a look at that so they've got a whole docs section where you can basically use it with different kinds of services there's lots of examples here they've even got a collab here I'm actually not going to go through that one here that sort of seems to focus more on their image model stuff what I'm going to do is we're going to go through a notebook of using this with Lang chain and and looking at how you could use it with Lang chain so we're now in the codelab to basically look at using llama2 with replicate and using it with Lang chain and you see that you'll get your replicate API token and you'll put it in here and then you just import the llm as replicate and you set it up something like this you basically go to replicate and get the key for the model so you can see here this is using the Llama 13 billion this is in in one of their examples you run it through and then you will basically get this back you can also stream it out if we're streaming something out we can run through and we can see that okay the streaming will come out quite nicely and it's quite quick so this is using the the 13 billion even with the 70 billion you will see that the streaming is pretty decent in here so this is the 70 billion model and obviously the streaming is slower but we are getting the streaming coming through on collab it tends to go very wide here but we can see that okay we've got streaming going along nicely and this is a llama 2 model running at full resolution in here so I've just taken the notebook from the previous video and just converted that across it actually doesn't require much conversion at all you can see that okay here we've basically just gotten rid of the the pipeline for when we were running the 7B in here and we've now swapped it out with the llm so that we're running the 70b in here so we've got the summarization and stuff like that that we did in the previous video you'll see here that I've got streaming coming back and I've also got at the end it prints it out I'm sure so the streaming went on top it doesn't wrap it and where with this one it it is wrapping it we've got a simple chat bot that's the same as what I did in the previous video in future we'll have a look at using this with some tools and some other stuff as well you can see that here it's able to go through a sort of conversation and I've put in the time so we can just see the wall time of roughly how long these are taking to to predict so you will see when I asked the one about the Olympics it does actually take 62 seconds to come back so that's costing us around 19 cents it's not cheap to run these models in the cloud and this is what I often think that people give open AI a hard time for the cost there if you're going to run these things in production that you often find that running your own models can be very expensive but the advantage that you have here obviously is that you can fully fine tune this yourself and set it up the way that you want it as opposed to open AI currently where we can't do that you will find that most of the the responses are in pretty decent times for a 70 billion full resolution model so this just gives you one advantage of basically using replicate to serve this kind of model you can actually do the fine tuning on replicate as well I'm not going to be looking at that in this video but we just wanted to look at getting the full 70b model up so that we can then use it from some other things I'm also going to look at getting this going I know a lot of people are really eager with the four bit one so I'm trying in a bunch of different 4-bit ones to work out what I think is going to be the best for that currently anyway as always if you've got questions please feel free to put them in the comments below if you're interested in seeing more videos for this kind of stuff please click and subscribe and I will talk to you in the next video bye for now thank you

Original Description

Serving LLaMA2 with Replicate Colab: https://drp.li/SBO4S Replicate site: https://replicate.com/ #this video is not sponsored by them For more tutorials on using LLMs and building Agents, check out my Patreon: Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen My Links: Linkedin: https://www.linkedin.com/in/samwitteveen/ Github: https://github.com/samwit/langchain-tutorials (updated) https://github.com/samwit/llm-tutorials 00:00 Intro 00:26 Play around with LLaMA 2 Chatbot 01:51 Replicate.com 02:44 Replicate LLaMA 2 70B Chatbot 03:12 Replicate Pricing 05:21 Replicate Docs 05:44 Code Time

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Sam Witteveen · Sam Witteveen · 0 of 60

← Previous Next →

LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab

LangChain Basics Tutorial #1 - LLMs & PromptTemplates with Colab

LangChain Basics Tutorial #2 Tools and Chains

LangChain Basics Tutorial #2 Tools and Chains

ChatGPT API Announcement & Code Walkthrough with LangChain

ChatGPT API Announcement & Code Walkthrough with LangChain

Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference

Trying Out Flan 20B with UL2 - Working in Colab with 8Bit Inference

LangChain - Conversations with Memory (explanation & code walkthrough)

LangChain - Conversations with Memory (explanation & code walkthrough)

LangChain Chat with Flan20B

LangChain Chat with Flan20B

LangChain - Using Hugging Face Models locally (code walkthrough)

LangChain - Using Hugging Face Models locally (code walkthrough)

PAL : Program-aided Language Models with LangChain code

PAL : Program-aided Language Models with LangChain code

Building a Summarization System with LangChain and GPT-3 - Part 1

Building a Summarization System with LangChain and GPT-3 - Part 1

Building a Summarization System with LangChain and GPT-3 - Part 2

Building a Summarization System with LangChain and GPT-3 - Part 2

Microsoft's Visual ChatGPT using LangChain

Microsoft's Visual ChatGPT using LangChain

Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo

Building a Summarization System with LangChain - Part 3 Using ChatGPT Turbo

LangChain Agents - Joining Tools and Chains with Decisions

LangChain Agents - Joining Tools and Chains with Decisions

Investigating Alpaca 7B - Finetuned LLaMa LLM

Investigating Alpaca 7B - Finetuned LLaMa LLM

Comparing LLMs with LangChain

Comparing LLMs with LangChain

Running Alpaca7B in Colab

Running Alpaca7B in Colab

How to finetune your own Alpaca 7B

How to finetune your own Alpaca 7B

How to make a custom dataset like Alpaca7B

How to make a custom dataset like Alpaca7B

Understanding Constitutional AI - the paper and key concepts

Understanding Constitutional AI - the paper and key concepts

Using Constitutional AI in LangChain

Using Constitutional AI in LangChain

Talking to Alpaca with LangChain - Creating an Alpaca Chatbot

Talking to Alpaca with LangChain - Creating an Alpaca Chatbot

Text-to-video-synthesis with Diffusers and Colab

Text-to-video-synthesis with Diffusers and Colab

Meet Dolly the new Alpaca model

Meet Dolly the new Alpaca model

Checking out the Cerebras-GPT family of models

Checking out the Cerebras-GPT family of models

A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)

A Step-by-Step Guide to Fine-Tuning Your Dolly Model (tutorial)

Is GPT4All your new personal ChatGPT?

Is GPT4All your new personal ChatGPT?

Raven - RWKV-7B RNN's LLM Strikes Back

Raven - RWKV-7B RNN's LLM Strikes Back

Talk to your CSV & Excel with LangChain

Talk to your CSV & Excel with LangChain

Vicuna - 90% of ChatGPT quality by using a new dataset?

Vicuna - 90% of ChatGPT quality by using a new dataset?

Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍

Koala Revealed: The ChatGPT Alternative You Need to Know! 🔍

Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)

Running Koala for free in Colab. Your own personal ChatGPT? (tutorial)

BabyAGI: Discover the Power of Task-Driven Autonomous Agents!

BabyAGI: Discover the Power of Task-Driven Autonomous Agents!

Auto-GPT - How to Automate a Task Based AI with GPT-4

Auto-GPT - How to Automate a Task Based AI with GPT-4

Improve your BabyAGI with LangChain

Improve your BabyAGI with LangChain

Generative Agents - Deep Dive and GPT-4 Recreation

Generative Agents - Deep Dive and GPT-4 Recreation

GPT4ALLv2: The Improvements and Drawbacks You Need to Know!

GPT4ALLv2: The Improvements and Drawbacks You Need to Know!

Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!

Dolly 2.0 by Databricks: Open for Business but is it Ready to Impress!

Red Pajama - Operation: Freeing LLaMA

Red Pajama - Operation: Freeing LLaMA

Investigating Open Assistant - Models, Datasets and Addons

Investigating Open Assistant - Models, Datasets and Addons

Investigating MiniGPT-4 - The Secret behind GPT-V?

Investigating MiniGPT-4 - The Secret behind GPT-V?

Stable LM 3B - The new tiny kid on the block.

Stable LM 3B - The new tiny kid on the block.

Bard can now code and put that code in Colab for you.

Bard can now code and put that code in Colab for you.

Checking out Bark: a Text to Speech system by Suno AI

Checking out Bark: a Text to Speech system by Suno AI

Fine-tuning LLMs with PEFT and LoRA

Fine-tuning LLMs with PEFT and LoRA

Master PDF Chat with LangChain - Your essential guide to queries on documents

Master PDF Chat with LangChain - Your essential guide to queries on documents

Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools

Using LangChain with DuckDuckGO Wikipedia & PythonREPL Tools

Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)

Building Custom Tools and Agents with LangChain (gpt-3.5-turbo)

StableVicuna: The New King of Open ChatGPTs?

StableVicuna: The New King of Open ChatGPTs?

WizardLM: Evolving Instruction Datasets to Create a Better Model

WizardLM: Evolving Instruction Datasets to Create a Better Model

LaMini-LM - Mini Models Maxi Data!

LaMini-LM - Mini Models Maxi Data!

Finding the Best Free ChatGPT

Finding the Best Free ChatGPT

MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model

MPT-7B - The First Commercially Usable Fully Trained LLaMA Style Model

LangChain Retrieval QA Over Multiple Files with ChromaDB

LangChain Retrieval QA Over Multiple Files with ChromaDB

LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs

LangChain Retrieval QA with Instructor Embeddings & ChromaDB for PDFs

LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

Transformers Agent - Is this Hugging Face's LangChain Competitor?

Transformers Agent - Is this Hugging Face's LangChain Competitor?

StarCoder - The LLM to make you a coding star?

StarCoder - The LLM to make you a coding star?

Testing Starcoder for Reasoning with PAL

Testing Starcoder for Reasoning with PAL

The New Wizards - Unfiltered & Unaligned

The New Wizards - Unfiltered & Unaligned

Camel + LangChain for Synthetic Data & Market Research

Camel + LangChain for Synthetic Data & Market Research

This video teaches how to serve LLaMA2 with Replicate, a cloud platform that allows for fine-tuning and deployment of LLMs. It covers the features and capabilities of Replicate, including API tokens, Lang chain integration, and pricing based on hardware usage. By following this tutorial, viewers can learn how to deploy and fine-tune LLaMA2 models on Replicate and use them for natural language processing tasks.

Key Takeaways

Import the LLaMA2 model using Replicate API token
Set up the LLaMA2 model with Lang chain
Run the LLaMA2 model with the 13 billion parameters
Run the LLaMA2 model with the 70 billion parameters
Stream the output of the LLaMA2 model
Fine-tune the LLaMA2 model on Replicate
Configure the LLaMA2 model on Replicate for optimal performance

💡 Replicate allows for full control over model setup and provides a cost-effective way to deploy and fine-tune LLaMA2 models, making it an attractive option for developers and researchers working with LLMs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

LLM Tokens Explained: Cost, Memory, Speed and Context Windows

Understand LLM tokens and their impact on cost, memory, speed, and context windows to optimize your language model usage

5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)

Learn about the top time-aware memory layers for building long-term AI agents and improve their performance in stateful interactions

Medium · Machine Learning

5 Best Time-Aware Memory Layers for Long-Term AI Agents (2026 Guide)

Learn about the 5 best time-aware memory layers for long-term AI agents and improve their performance in autonomous tasks

Arquitetura Cognitiva de Baixa Latência: Padrão RAG com Cache

Learn how to implement low-latency cognitive architecture using the RAG pattern with cache to overcome corporate adoption hurdles of Generative AI

Medium · Python

Chapters (7)

Intro

0:26 Play around with LLaMA 2 Chatbot

1:51 Replicate.com

2:44 Replicate LLaMA 2 70B Chatbot

3:12 Replicate Pricing

5:21 Replicate Docs

5:44 Code Time

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)