LLM Routers Explained!!!

1littlecoder · Beginner ·🧠 Large Language Models ·1y ago

Skills: LLM Foundations90%Prompt Craft80%LLM Engineering70%Fine-tuning LLMs60%Prompt Systems Engineering50%

Key Takeaways

The video explains LLM routing, a solution to efficiently process queries by directing them to the most suitable LLM, minimizing cost while maintaining quality, using tools like Route LM, Elmis, and GP4, and techniques such as similarity weighted SW ranking, matrix factorization, and deep learning-based approaches.

Full Transcript

most efficient way to build an llm based application in 2024 is to use multiple llms and there are multiple ways to do it you have mixture of Agents sometimes people just stack multiple llms but one of the ways to use it is to use a router a router as the name suggest is nothing but a system that can route your query to the most ideal llm so for example a prompt comes in it goes to the router and the router decides what is the right llm in this particular particular case that should handle this particular prompt it sends it to either gp4 or Gemini flash or Quin 2 like for example quen 2 is the cheapest in this gp4 is the most expensive one in this and Gemini flash is somewhere in the middle so the router decides in the most intelligent way about what is this prompt and where this prompt should go in such a way that it can minimize the cost at the same time it can serve the prompt in the most appropriate way this is one of the most popular approach and this is completely U something that people have been doing privately it's not something new like I've been Consulting some companies where a router based approach is something that people have been doing say they have built their own inbuilt classifier sometimes even a simple reject based intent uh understanding intent classifier that can tell you what is the prompt like for example if this prompt requires reasoning maybe gp4 is the right model if this prompt does not require reasoning you can send it to a lower tier model but now there is a standardization happening with this and there is there is a new open source initiative that is called route LM it is coming from a very popular organization called elmis which is one of the most popular chatbot benchmarks that we have got the llm Benchmark the chatbot arena is from lmis so lmis has created and open sourced a new cost effective llm routing framework and that is something that they call route llm so now when I say framework now you might think this is llama index or Lang chain so no this is not llama Index this is not Lang chain rather this is a completely new model that they've created so they've trained four routers using a mix of chatbot Arena data and data augmentation data augmentation is nothing but take this existing data and add your own data modify your data transform the data and then create four sort of routers I hope at this point you all understand what is a router so if you are not very clear just to clear it you have a query that is coming from the user but rather than typically what would you do uh any llm based AI application take this query send it to gp4 open Ai call get the response back show it to the user but that is not the most efficient way in terms of cost and also in terms of fallback mechanisms so you install a router in the middle the router decides where this should go whether to a model like gp4 or whether a less expensive but probably suitable for this T task kind of a model like Mixel 7 billion parameter model model so then it goes to that particular model get the response back give it to the user the user doesn't care about where this went unless until you actually completely screwed up the response but otherwise this router has done its job now one of the reason people do not use router is also because router also sometimes increases cost the way you implement but in this particular setup router does not increase the cost massively because the router is also a very small llm and that is why also a lot of people do not use llms for routers in the first place because see imagine you have like 100,000 token and uh you have to send it to gp4 let say so one you're going to make 100,000 token here but at the same time the same 100,000 token you're going to send it here so ultimately you're going to be built for 200,000 token so sometimes people do not want to do that they just go with gp4 that's why it's very key for you to decide what is that uh particular router or setup that you want to use but according to Route llm they have said that your cost can go by almost like 22 times and that is their main pitch they saying that okay your cost calls can reduce by 50% that means like instead of spending let's say $100,000 you can spend $50,000 now assuming that all these calls that you are going to send to gp4 and again this may not be completely uh relevant for you if you are somebody who only uses a 7 million parameter model or maybe you using only a quan model but the approach could be really helpful for you to decide how do you want to Stack these llms for your LM application so let's begin with what they have done so they have trained four routers uh with the data so we know what is the data that they have used the chatbot Arena data the data augmentation and then the four types of routers that they trained as you can see here one is Randomness completely randomly you send it it's like you know you have given a button to a monkey and the monkey is going to decide when something should be sent to which llm that's it completely random nothing uh unless until the monkey is uh let's say strapped with neural link Maybe mus would say that an intelligent monkey but completely random is this central line but then they have got four kinds of routers the first one is a similarity weighted s SW ranking router that performs a weighted ELO calculation based on similarity so now ELO is something that we have already seen it's kind of a ranking mechanism it's it is how tennis players are ranked it is how chess players are ranked it is how chatbot also ranks the model so it's a weighted ELO calculation and then the rout decides based on the similarity of the prompt and it calculates the ELO calculation and based on the ELO score the top rank model gets the prompt this is the most probably the simplest approach that you can do create a ranking mechanism in this case they have created a similarity weighted ranking router the second one is a quite interesting approach for anybody who is watching here who has got a background in a recommendation engine recommendation system might already connect with this a matrix factorization model that learns a scoring function for how well a model can answer a prompt this is almost the foundation of how Netflix would recommend a particular show or a particular movie to you so you have certain preference Netflix show has certain attributes and it recommends you based on this and if you want to learn more about Matrix factorization I would link this Google uh documentation or tutorial in the YouTube description so you can see that there are four different kind of people there are four different kind of or five different kind of movies and based on the preference of existing people you recommend this movie to a new person and that is almost what happens in a recommendation engine so Matrix factorization is a technique which is like a very simple embedding technique you create a representation so what they have done is they have created a router that uh that is basically a matrix factorization model that learns a scoring function for how well a model can answer a prompt something like this and based on the score you wrote it the third one is another approach a deep learning based approach a bir classifier that predicts which model can provide a better response you can use a bird classifier that they' have done here but if you want the simplest approach like build an XG boost model that can probably tell you which router it should go with the F sorry which model it should go with the final one is a causal llm classifier that also predicts which model can provide a better response so now if you see this ranking this will be the costliest model this will be the cheapest model and somewhere in the middle you're talking about a deep learning based model and also Matrix factorization so this was actually using an llm for a router but all these techniques so this is using a deep learning based model Like a Bird model this is not even using a deep learning based model but the fundamentals of that and this is simply using a ranking mechanism so these are the four methods that they have used to build a router and when you see the percentage of calls that uh were gone to gbd4 if you see the bird based models you can see before augmentation the bir based model favored Mixel most while the ranking method favored gp4 most so you can see the top one is gp4 this one is Mixel so it's a scale between mixol and gp4 and then you can see that the number of calls to gp4 has reduced like I said this is again if your basine is gp4 then this is an excellent approach but if you're if if you are not using gp4 already if you're using a cheaper model maybe this is not good enough for you and again there are certain comparisons between route llm versus other commercial offerings like for example youve got the rout LM percentage of calls to gp4 and uh you can see the Baseline is gp4 here to Lama 2 model and there is a tool called Martian so Martian stays here and their claim is that route llm is better than Martian so you have got UniFi AI which is another router it is also an open source library but also I think they've got like a cloud offering so unify AI is somewhere here but again the caal LM and uh the Matrix factorization is somewhere here so when you put together all these things uh one important thing that you have to understand is you are building a router not just to save cost but also you don't want to compromise on the accuracy or the scores that it is going to provide so that is very important because you are trying to significantly reduce the cost but without compromising the quality so cost reductions over 85% on mty bench 45% on mlu while 35% on gsm 8K as compared to one gp4 again this is not using gp4 turbo as a baseline this is using gp4 but overall I would say this is a very interesting approach so you have got the cost in log scale in the x-axis youve got the model performance in the ya axis and youve got models all over the place and the ideal router is somewhere here so the cost is much lesser than even GPT 40 but it is somewhere closer to Gemini 1.5 flash somewhere closer to Claude 3.53 hu so Claude 3.5 would come come somewhere here it's even better than GPT 3.5 turbo but in terms of model performance it's far above the mistal Mixel model the GPT 3.5 turbo model uh surprisingly they didn't compare it with Quinn or like deep SE coder but something that I would love to do they've also released a detailed paper explaining all their techniques and they've also kindly shared the model in itself for us to use it theyve partnered with any scale for this and uh the model is also available open source on hugging face if you want to use it you can right away start using it I might put together a Hands-On tutorial about how to use this llm router for you to use uh within your production application but until that I think this is an excellent application some people have started standardizing and like you see here there are already paid offerings tools that are available here to make people already optimize their cost and like I said it's not only about cost you need fallback mechanism if you want to build a robust software application on top of llms and this is an excellent way to reduce the cost while also not compromising on the quality while also not relying on one single Monopoly llm so this is an excellent solution thank you LM lmis for providing this see you in another video Happy prompting

Original Description

LLM routing offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing. To tackle this, we present RouteLLM, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance. We also publicly release all our code and datasets, including a new open-source framework for serving and evaluating LLM routers. 🔗 Links 🔗 RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing https://lmsys.org/blog/2024-07-01-routellm/ ❤️ If you want to support the channel ❤️ Support here: Patreon - https://www.patreon.com/1littlecoder/ Ko-Fi - https://ko-fi.com/1littlecoder 🧭 Follow me on 🧭 Twitter - https://twitter.com/1littlecoder Linkedin - https://www.linkedin.com/in/amrrs/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from 1littlecoder · 1littlecoder · 0 of 60

← Previous Next →

How to create your Free Data Science Blog on Github with Fastpages from Fastai

How to create your Free Data Science Blog on Github with Fastpages from Fastai

Making Interactive Matplotlib Plots for Data Science Visualizations on Jupyter (Python)

Making Interactive Matplotlib Plots for Data Science Visualizations on Jupyter (Python)

Create your first Data Science Web App using R Shiny

Create your first Data Science Web App using R Shiny

How to create a Reproducible Example in R using reprex

How to create a Reproducible Example in R using reprex

No Code Visualization using esquisse with Tableau-like Drag and Drop GUI in R

No Code Visualization using esquisse with Tableau-like Drag and Drop GUI in R

Scrape HTML Table using rvest and Process them for insights using tidyverse in R

Scrape HTML Table using rvest and Process them for insights using tidyverse in R

Google Teachable Machine Learning Build No Code AI solution

Google Teachable Machine Learning Build No Code AI solution

Create meaningful fake tidy datasets in R using fakir [#rstats Package]

Create meaningful fake tidy datasets in R using fakir [#rstats Package]

How to enable using R Programming with Visual Studio VS Code

How to enable using R Programming with Visual Studio VS Code

Python, Community, Books - with Abhiram R - Bangpypers Co-organizers | 1littlecoder podcast

Python, Community, Books - with Abhiram R - Bangpypers Co-organizers | 1littlecoder podcast

Growing a Tech Community across India - Anubha Maneshwar, Founder Girlscript | 1littlecoder Podcast

Growing a Tech Community across India - Anubha Maneshwar, Founder Girlscript | 1littlecoder Podcast

Intro to Google Colab - How to use Colab

Intro to Google Colab - How to use Colab

Intro to Plotly Express - Complex Interactive Charts with One-Line of Python Code

Intro to Plotly Express - Complex Interactive Charts with One-Line of Python Code

Indic NLP Python Toolkit Open Source Development - iNLTK Creator Gaurav Arora | 1littlecoder Podcast

Indic NLP Python Toolkit Open Source Development - iNLTK Creator Gaurav Arora | 1littlecoder Podcast

Do you want a career in Data Science - Tamil Webinar

Do you want a career in Data Science - Tamil Webinar

Android Smartphone Analysis in R [Live Coding Screencast]

Android Smartphone Analysis in R [Live Coding Screencast]

Programmatically create Images, Memes, Watermarks using Python with imgmaker

Programmatically create Images, Memes, Watermarks using Python with imgmaker

Kaggle Walkthrough to get you started with Data Science - Webinar

Kaggle Walkthrough to get you started with Data Science - Webinar

Community, Corporate Job, Coding - Gnana Lakshmi T C aka Gyan, WomenWhoCode Leadership Fellow

Community, Corporate Job, Coding - Gnana Lakshmi T C aka Gyan, WomenWhoCode Leadership Fellow

Easy ggplot2 Theme Customization with {ggeasy} | Data Visualization in R

Easy ggplot2 Theme Customization with {ggeasy} | Data Visualization in R

Excel to R - Pivot + Bar Chart in Excel & R using tidyverse [Live Coding]

Excel to R - Pivot + Bar Chart in Excel & R using tidyverse [Live Coding]

Excel to R #2 - VLOOKUP in Excel to LEFT_JOIN, MERGE in R

Excel to R #2 - VLOOKUP in Excel to LEFT_JOIN, MERGE in R

5 websites to get Free Real-World Datasets for Data Science/ML Projects

5 websites to get Free Real-World Datasets for Data Science/ML Projects

Excel to R #3 - APPROXIMATE VLOOKUP in Excel to FUZZY LEFT_JOIN in R

Excel to R #3 - APPROXIMATE VLOOKUP in Excel to FUZZY LEFT_JOIN in R

Correlation-alternative PPS (Predictive Power Score) Python Package Demo

Correlation-alternative PPS (Predictive Power Score) Python Package Demo

Automated Website Screenshots in R using {webshot}

Automated Website Screenshots in R using {webshot}

Installing Custom RStudio Theme (Synthwave85)

Installing Custom RStudio Theme (Synthwave85)

Analyse Google Trends Search Data in R using {gtrendsR}

Analyse Google Trends Search Data in R using {gtrendsR}

3 Tips to ask question on Stack Overflow the right way to get answers

3 Tips to ask question on Stack Overflow the right way to get answers

Learn Data Science with R - Mini Projects - Web Scraping Zomato

Learn Data Science with R - Mini Projects - Web Scraping Zomato

Easily make Dumbbell Chart using {ggcharts} | Data Visualization in R

Easily make Dumbbell Chart using {ggcharts} | Data Visualization in R

GET Hackernews Front Page Results using REST API in R

GET Hackernews Front Page Results using REST API in R

Quickly deploy ML WebApps from Google Colab using ngrok

Quickly deploy ML WebApps from Google Colab using ngrok

Use Jupyter Notebooks within VSCode (Visual Studio Code) in 2020

Use Jupyter Notebooks within VSCode (Visual Studio Code) in 2020

Plotly Interactive Plots as Pandas Plotting Backend df.plot()

Plotly Interactive Plots as Pandas Plotting Backend df.plot()

Stack Overflow Developer Survey 2020 Highlights for New Programmers

Stack Overflow Developer Survey 2020 Highlights for New Programmers

Matplotlib Animation Charts in Python using Celluloid

Matplotlib Animation Charts in Python using Celluloid

Coding, Postwoman, Passion Project Book - Liyas Thomas Open Source Developer - 1littlecoder podcast

Coding, Postwoman, Passion Project Book - Liyas Thomas Open Source Developer - 1littlecoder podcast

Aspiring Data Scientist, Tips on How to learn Business Domain Knowledge

Aspiring Data Scientist, Tips on How to learn Business Domain Knowledge

Bokeh Interactive Charts as Pandas Plotting Backend df.plot_bokeh()

Bokeh Interactive Charts as Pandas Plotting Backend df.plot_bokeh()

Easy Fast Python Pandas Summary with Sidetable | Pandas Tips & Tricks

Easy Fast Python Pandas Summary with Sidetable | Pandas Tips & Tricks

Inception, Content Ideas, Consistency - Srivatsan Srinivasan AIEngineering YouTube Content Creator

Inception, Content Ideas, Consistency - Srivatsan Srinivasan AIEngineering YouTube Content Creator

ggplot2 Text Customization with ggtext | Data Visualization in R

ggplot2 Text Customization with ggtext | Data Visualization in R

Penguins Dataset Overview - iris alternative | EDA Data Visualization in R

Penguins Dataset Overview - iris alternative | EDA Data Visualization in R

YouTube Growth Tips, Content Creation - Bhavesh Bhatt, YouTuber (Data Science & Machine Learning) #7

YouTube Growth Tips, Content Creation - Bhavesh Bhatt, YouTuber (Data Science & Machine Learning) #7

Matplotlib Animated Bar Chart Race in Python | Data Visualization

Matplotlib Animated Bar Chart Race in Python | Data Visualization

Simple Python GUI Development using {guietta}

Simple Python GUI Development using {guietta}

#8 Niche, Growth, Monetization - David Langer - YouTuber Dave on Data

#8 Niche, Growth, Monetization - David Langer - YouTuber Dave on Data

Simple Fast 3-step Python OCR using Deep Learning 40+ Languages

Simple Fast 3-step Python OCR using Deep Learning 40+ Languages

Github New Feature Profile Summary/Mini-Resume - Profile Views

Otto ML Assistant, GPT-3 on Philosophers, Nvidia-ARM - 3 ML Tech News

Otto ML Assistant, GPT-3 on Philosophers, Nvidia-ARM - 3 ML Tech News

What is OpenAI GPT-3 - Hype, Examples, Worries

What is OpenAI GPT-3 - Hype, Examples, Worries

Julia 1.5, Datamuse API, Live HDR+ Pixel 4a - Machine Learning Tech News

Julia 1.5, Datamuse API, Live HDR+ Pixel 4a - Machine Learning Tech News

Self-driving Car Engineer sentenced, arXiv Dataset, AI/ML Startup Idea - Machine Learning Tech News

Self-driving Car Engineer sentenced, arXiv Dataset, AI/ML Startup Idea - Machine Learning Tech News

GPT-3 Explorer, Ciphey (Automated Decryption), Py-Sudoku - ML Tech News

GPT-3 Explorer, Ciphey (Automated Decryption), Py-Sudoku - ML Tech News

How to use Advanced Google Search to extract Email Ids from Linkedin

How to use Advanced Google Search to extract Email Ids from Linkedin

Cartoonizer Toon-IT (AI Web App), GPT-3 Advice, Android Earthquake Detection - ML Tech News

Cartoonizer Toon-IT (AI Web App), GPT-3 Advice, Android Earthquake Detection - ML Tech News

Flow - R Package to visualize code logic, functions as a Flow Diagram

Flow - R Package to visualize code logic, functions as a Flow Diagram

Build GPT-3-like Language Model on Google Colab with minGPT [PyTorch]

Build GPT-3-like Language Model on Google Colab with minGPT [PyTorch]

Create a Pencil Sketch Portrait with Python OpenCV

Create a Pencil Sketch Portrait with Python OpenCV

This video teaches how to efficiently process queries using LLM routing, minimizing cost while maintaining quality, and provides an overview of various techniques and tools used in LLM routing, including similarity weighted SW ranking, matrix factorization, and deep learning-based approaches. By the end of this video, viewers will understand how to build LLM routers, implement cost-effective LLM routing, and design LLM architectures. The video also highlights the importance of prompt-based routi

Key Takeaways

Build an LLM router using a small LLM or a custom classifier
Implement a similarity weighted SW ranking router
Use a matrix factorization model to learn a scoring function for how well a model can answer a prompt
Train a deep learning-based model to predict which model can provide a better response
Use a causal LLM classifier to predict which model can provide a better response
Build an XGBoost model to select the best model
Compare different LLM routers and commercial offerings
Implement a fallback mechanism for robust software application

💡 LLM routing can reduce cost by 50% compared to using a single large model, while maintaining quality, by directing queries to the most suitable LLM based on prompt characteristics and cost considerations.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective

Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective

Claude Sonnet 5 closes the gap to Opus without the Opus bill

Claude Sonnet 5 emerges as a cost-effective alternative to Opus, learn how it closes the gap without the hefty price tag

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)