Harnessing the Power of LLMs Locally: Mithun Hunsur

AI Engineer · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Prompt Craft80%Fine-tuning LLMs70%Prompt Systems Engineering60%

Key Takeaways

The video discusses the use of LLMs locally using the Rust library LMRS, enabling developers to run large language models on standard hardware, and explores the benefits and limitations of local models, including fine-tuning, quantization, and knowledge retrieval. It also covers various tools and libraries, such as LLM RS, Llama CPP, and Rust, and demonstrates code examples for loading models, creating sessions, and passing prompts for inference.

Full Transcript

[Music] right good day everyone good to see you all today I'm here to tell you to how to harness the power of local LMS using our rust Library quick intro I'm aone as you just heard but I go by Phil packs online I hail from Australia H the accent but I live in Sweden I do a lot of things for computers but my day job is at ambient where I build a game engine of the future today though I'm here to talk to you about lm. RS a rushed library that I maintain so lm. RS or LM between friends I realized that after dis ambigu when I started when I signed the Simon newsletter it's an all in one solution for local inference of llms but what does that actually mean well most of the models we've discussed at this conference have been Cloud models you chat P your claws your BS local models offer another way where you own the model and it runs on your computer so let's quickly go over what that actually means first up size model size can be used as a rough proxy for the intelligence of the model most to models are really really big you can see that it's dominating the right hand side of the chart there you have your gpt3 your GPD 4 we'll get back to that you go for your palm 2 these are all insanely big in comparison to the open source models we have we we're beginning to see some uh bigger models thanks to Lon Falcon but even they pale in comparison to what the bigger players can do this means the local models don't have the same capacity for intelligence however a smaller more focused model may be able to solve problems better than a large General model by the way we don't actually know what size gb4 is that's rumors uh only open AI knows next let's talk about speed and capacity Cloud models run on Specialized Hardware with special configuration local models run on whatever Hardware you can scrun up including rented Hardware the further up the access you go the more speed and or parallel inference you can do but the more inaccessible becomes this end a few hundred that end a few hundred million next up latency Cloud models need the full prompt before they can start inference and you have to wait for the message back and uh back and forth local models can give you a response immediately you can feed the prompt as you go along this is very important for conversations where you want the model to be able to process what you're saying as you say it and of course you can't escape talking about cost the cloud vendors will charge you a per token price when running locally it's entirely up to you how much it costs you to run the machine if the running cost of your uh model is less than the cost of running your workload through the cloud you're going to make a profit and if you're running on a machine you already own well that's basically free right with the cloud you have to offer you have to use the models they offer you some vendors offer fine tuning but they often charge more than uh just using the regular model and they often charge you for the process of actually fine-tuning this means that it's not often cost effective to actually do that with local models the sky the limit there are hundreds potentially thousands of custom models that can suit any need you have knowledge retrieval storytelling conversation tool use you name it someone's probably already done it and if they haven't find tuning an existing model for your own use is easy enough special shout out to axle over there which makes it easy to find models of any architecture and of course privacy there are some questions you don't want to ask the internet local mods let you privately embarrass yourself now you might be wondering how it's actually possible to run these models locally that my friends is possible with the power of quantization if each model is billions of parameters and those parameters are are like individual numbers how could you how could you possibly run them a consumer Hardware when there's only so much memory given for available for a given uh performance level well we can use quantization quantization lets you Lely compress a model while maintaining the majority of its Ms we can take the original model here in blue and squish it down to something much smaller using one of these green formats this is a secret source that makes it viable to run models locally small models aren't easier to store aren't just easier to store they can also run faster as your process uh as your computer can process more of the model at any given moment but that's enough about local models you've probably already heard much much that already let's talk about the actual Library it all started with this man who buil something you may have heard of of course I'm referring to L cppp and that's what it looked like on day one look at the mere 98 Stars how pedestrian compared to today we're it's 42,000 Stars uh but let's go back to March when I first saw it when I saw it I had but one idea it's time to reroute it and rust for both the meme and because I wanted to do use it for other things well I wanted to say I well I said I wanted to do it and I did but to the briide here set of 22 was also working on the same problem and well there was just one catch he beat me it he beat me to it completely beat me to it I'm not afraid to admit it luckily we came together managed our projects and I ended up as a maintainer of the resulting project and that's how LM was born so you might be wondering why if llama CPP exists why use lmrs well with lm. RS I had six principles in mind it must be a library when I first started in March llama CPP was not a library it was an application and that made it impossible to reuse it must not be coupled to an application you must be able to customize Its Behavior you must be able to go in and change every little bit of it to make it work for your application and we we shouldn't make any assumptions about how it's going to be used uh it should support a multitude of model architectures of course llama CPP supports llama and our Falcon but clearly there are more out there next up it should be rust native it should feel like using a rust Library it shouldn't feel like using a a library with bindings and it should feel work how you expect a rust library to work next up backends it should support all other all possible kinds of backends you can write on your CPU your GPU or of course your ml power toaster I'm sure that's going to be a thing we we we were going to see it coming I'm I swear and finally platforms it should work the same whether it's on Windows Linux Mac OS or something else it shouldn't have you shouldn't have to change it significantly to make it work because deployment has always been an issue today I'm proud to say we support a myriad of architectures incl including the uh The Darlings of the movement llama and Falcon these architectures all use the same interface so you don't have to worry about changing your code to use a different model this is made possible by the coordinate coordinate concerted efforts by co-contributors Lucas and Dan who couldn't have done this without as well as well as many others here's some sample code for the library I won't go too much into it because it's quite dense but the idea is that you load a model right there at the top you can see it's actually quite small and with that model you create sessions which an ongoing use to the model you can have as many of these as you would like but they do have a memory cost so you want to be careful once you have a session you can pass you can pass a prompt in and infer with the model to determine what comes next you can keep reusing the same session which is very useful for conversation you don't need to keep ring the context the last argument of the call of the function is the Callback that's where you actually get the tokens out um it's worth noting that the function itself is actually a helper all it does is call the model in a loop loop with some boundary conditions so if you want to change the logic in some significant way you can we're not going to stop you from doing that one last thing about this though you see all the Cs to default there those are all customization points you can change pretty much anything about this you can change how the model is loaded you can change how it'll do the inference you can change how it sample the entire point is you have the control you need to make the thing uh you need to work here's a quick demo of uh the library working with llama 7 uh on my MacBook CPU it's reasonbly fast but it could be faster right well thanks to the power of GPU acceleration we have something that's much more usable and believe me it's even faster than nid gpus AMD and Intel support uh pending now let's talk about what you can actually do with a library let's start with three Community projects to begin with first we've got local AI local AI is a simple app that you can install to do inference locally there's nothing magical about it it's just exactly what it says I think that's really wonderful because it it means anyone can download this app and get ready uh get be able to use local models without think about it next up LM chain it's a lang chain but for rust and of course it's Sports inference with our library and finally we have flum which is a flowchart based application where you can build your own workflo I think we've seen a few of that few of those at this this conference and you can combine and create V nodes to uh build the workflow you need and of course it supports the library as an inference engine now I wouldn't be a very good Library author if I didn't actually test my own Library so I'm going to go through three applications the first two approves the concept the first is LM code it's a Discord bot you can see it's exactly what you'd expect you send uh give it a prompt it'll give you a response any hitches you see come from Discord limits not from the actual uh inferencing itself you can see bam all there when an issue when our us issues a request for Generation it goes through this process here where the request goes through a generation thread uh with a channel that channel is then used uh to create a response task and then that response task is responsible for sending the responses to the uh user now the interesting thing is these sessions are created and thrown away immediately with each query but you don't need to do that if you keep them around you can actually use them for conversation and just to illustrate this is just like the request response workflow you would use for anything if I just take what I had there drop the Disco bit and add in HTTP you can see request generation response easy next up Alpa I love using GitHub co-pilot but it's only available in my code data and it requires internet connection Alpa is my attempt to solve this it is order to complete anywhere in your system just by taking what's left of your cursor and uh having passing to a model to type in and of course you can use any model including a Model F tuned in own writing ask me how I know Alp is also quite simple in fact it's so simple I don't really need to cover it listen for input copy the input um into a prompt start generating type out response easy now the first two examples were pretty simple they approves the concept but now I want to talk about an actual use case this is a real World data extraction task over the last few years I've been working on a project to make a timeline for the dates of Wikipedia because there are millions of pages and they all have dates and you can build a world history from it however these dates are often unstructured and more or less impossible to passes and traditional means like yes you can try using Rex to extract the dates but you can't get the context out in any meaningful sense and there are some dates here that just don't make any sense at all so that's why as is the theme of this conference I threw a large language model audit how however GPT 3 and 4 aren't perfect even after rounds of prompt engineering you can see I tried here and handling millions of dates is just too expensive and slow so I decided I'd find CH my own model I generated a representive data set using gg3 built a a tool to go through the data set so pick out any data point fix it up and then correct the errors build a new data set and train a new model so I did that using axx LEL which I mentioned earlier again check out Axel for all your find Ching needs highly recommended and now have a small fast and consistent model that can pass any data to sorry any dat to and get back a structured representation which I can of course immediately pass using frust and I can treat that as a black box so I have a function there FN pass pass some dates get some dates back simple now let's quickly talk about the benefits of using local models and the library first off deployments show of hands who's have to deal with python deployment hell dependency hell even yeah yeah I know it's it's awful you spend hours just trying to sort out your your cond your pip your pipen it's awful with the library you inherit Russ excellent crossplatform support and build system making it easy to ship self- enclosed support uh binaries to your platform knowing more on making your use install torch as you might imagine this unlocks use of desktop applications with models next up the ecosystem rust has one of the strongest ecosystems of of any native language you can combine these libraries with llms to build all kinds of things it's will let me build a Discord bot a system or completion utility a data inje Pipeline with a data set a utility Explorer all in the same language and I think if you use lmrs you can do the same thing with your uh task as well of course you also have control over how uh the model generates I alluded to this earlier but you can choose exactly how it samples tokens normally when you use a cloud model you have to get back the uh logits the probabilities but those probabilities are limited like you have to keep going back and forth and that's slow and expensive with this you can directly control what you are sampling finally let's talk about the innovation in space if you're here you probably know there's a paper almost every single day it's impossible to keep up with trust me I've tried but it mean but the use of local models means you can try this out before anyone else can you can go through you can try out some of these papers and be like oh wow that's actually worthwhile Improvement and eventually the cloud providers will provide them but in the meantime the controller remains with you however it's time to talk about the problems there ain't no such thing as a free lunch except if you're a conference of course let's talk about Hardware again I mentioned earlier that you can pretty much run any uh these things on almost any hardware but that's kind of a lie you still need some kind of power you you can only get so much out of your 10-year-old computer your smartphone or your rasy Pi we're finding clever ways to improve this like smaller models and better influencing but it's still something to be aware of next as with all things the fast cheap good Tri applies you can make all kinds of trade-offs here and you see I've listed a couple of them here but fundamentally you have to choose what are you willing to sacrifice in in order to Ser your application are you willing to go for a bigger model to get better quality results at the cost of speed these are all decisions you have to make and they're not always obvious it's something you have to think about next there's no other way of putting this the ecosystem CHS Innovation is a double- Ed sword when those changes come in they can often break your existing workflows I've helped alleviate this to some extent using the ggf file format which help standardize but it's still a problem some days you will just wake up try application with a new model and just won't work there's nothing you can do except deal with it finally a lot of the models in this space are open source they're free for use personally but they have very strange Clauses and exceptions for most of us this this doesn't matter you can just use the model personally but it's a reminder that even though that these models are free they're not capital F free luckily there's been some recent change in the space with mistal and stable LM giving you strong performance of a small level uh sorry a small size and being completely unburdened but it's still a problem and they're still uh much smaller than the big ones like Lama and Falcon unfortunately I've got to wrap things up here there's only so much you can talk about in 18 minutes I'm afraid local models are great and I'd like to think our library is too they're getting easier to run day by day with smaller more powerful models however the situation isn't perfect and there isn't always one obvious solution for your problem thanks for listening you can contact me by email or by masteron the library can be found at you guest at lm. RS or by scanning the QR code finally we're always looking for contributors if you're interested in LMS or rust feel free to reach out sponsorships are also very welcome because they help me try out new hardware which is always necessary and if you want to chat in person I'll be hanging around the conference see you [Music] [Applause] later

Original Description

Discover llm, a revolutionary Rust library that enables developers to harness the potential of large language models (LLMs) locally. By seamlessly integrating with the Rust ecosystem, llm empowers developers to leverage LLMs on standard hardware, reducing the need for cloud-based APIs and services. In this talk, I'll explore llm's key features, including its high-speed inference, support for popular LLM architectures, and its lightweight design. Through practical examples, I'll showcase how llm can be applied in content generation, code completion, and language understanding tasks. Additionally, I'll discuss the challenges of deploying and maintaining LLMs locally, along with best practices and real-world experiences from early adopters. Recorded live in San Francisco at the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair About Mithun Hunsur Mithun is a seasoned polyglot programmer and engineer with a passion for exploring the depths of computer science. With experience spanning from the low-level to the highest reaches of software development, Mithun has worked on a diverse range of projects across various industries. During the day, he works on Ambient, an open-source runtime/engine for building high-performance multiplayer games and 3D applications. In his free time, he's a tinkerer at heart, diving into the world of game reverse engineering and modification, low-level and embedded programming, virtual and augmented reality, compiler and language hacking, human-computer interface research, and computer architecture and design. Beyond his work in the tech industry, Mithun also has a creative side, dabbling in photography, writing, and AI art. He brings a unique perspective to his work, combining his passion for technology with his artistic sensibilities to build projects that are both innovative and visuall

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 29 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

This video teaches developers how to harness the power of LLMs locally using the Rust library LMRS, and explores the benefits and limitations of local models. It covers various tools and libraries, and demonstrates code examples for loading models, creating sessions, and passing prompts for inference. By the end of this video, developers will be able to build and deploy local LLM models, fine-tune existing models, and craft effective prompts for improved model performance.

Key Takeaways

Load a model using LLM RS
Create sessions for ongoing use of the model
Pass prompts to the model for inference
Reuse sessions for conversation
Use GPU acceleration for faster inference
Customize model loading and inference
Generate a representative dataset using GPT-3
Build a tool to correct errors and train a new model using Axx LEL
Deploy a small, fast, and consistent model using the library

💡 Running LLMs locally using the Rust library LMRS enables developers to have more control over their models, reduces the need for cloud-based APIs and services, and provides a more cost-effective solution.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Building Business Intelligence Tools with LLM

Learn to build business intelligence tools with large language models, enabling interactive and language-driven interfaces for analysts and operators

Leveraging LLM for Business Intelligence

Learn how to build a conversational BI agent using LLM to turn English questions into SQL and get insights from structured data

Changes to LLM pricing: SambaNova

Learn about SambaNova's LLM pricing changes and their implications

I Stopped Chasing Tutorials and Started Building AI Apps. Here's What Changed.

Stop chasing tutorials and start building AI apps to gain practical experience and improve your skills

Dev.to · Muhammad Shahrukh

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)