LLM Routers Explained!!!

1littlecoder · Beginner ·🧠 Large Language Models ·1y ago

Key Takeaways

The video explains LLM routing, a solution to efficiently process queries by directing them to the most suitable LLM, minimizing cost while maintaining quality, using tools like Route LM, Elmis, and GP4, and techniques such as similarity weighted SW ranking, matrix factorization, and deep learning-based approaches.

Full Transcript

most efficient way to build an llm based application in 2024 is to use multiple llms and there are multiple ways to do it you have mixture of Agents sometimes people just stack multiple llms but one of the ways to use it is to use a router a router as the name suggest is nothing but a system that can route your query to the most ideal llm so for example a prompt comes in it goes to the router and the router decides what is the right llm in this particular particular case that should handle this particular prompt it sends it to either gp4 or Gemini flash or Quin 2 like for example quen 2 is the cheapest in this gp4 is the most expensive one in this and Gemini flash is somewhere in the middle so the router decides in the most intelligent way about what is this prompt and where this prompt should go in such a way that it can minimize the cost at the same time it can serve the prompt in the most appropriate way this is one of the most popular approach and this is completely U something that people have been doing privately it's not something new like I've been Consulting some companies where a router based approach is something that people have been doing say they have built their own inbuilt classifier sometimes even a simple reject based intent uh understanding intent classifier that can tell you what is the prompt like for example if this prompt requires reasoning maybe gp4 is the right model if this prompt does not require reasoning you can send it to a lower tier model but now there is a standardization happening with this and there is there is a new open source initiative that is called route LM it is coming from a very popular organization called elmis which is one of the most popular chatbot benchmarks that we have got the llm Benchmark the chatbot arena is from lmis so lmis has created and open sourced a new cost effective llm routing framework and that is something that they call route llm so now when I say framework now you might think this is llama index or Lang chain so no this is not llama Index this is not Lang chain rather this is a completely new model that they've created so they've trained four routers using a mix of chatbot Arena data and data augmentation data augmentation is nothing but take this existing data and add your own data modify your data transform the data and then create four sort of routers I hope at this point you all understand what is a router so if you are not very clear just to clear it you have a query that is coming from the user but rather than typically what would you do uh any llm based AI application take this query send it to gp4 open Ai call get the response back show it to the user but that is not the most efficient way in terms of cost and also in terms of fallback mechanisms so you install a router in the middle the router decides where this should go whether to a model like gp4 or whether a less expensive but probably suitable for this T task kind of a model like Mixel 7 billion parameter model model so then it goes to that particular model get the response back give it to the user the user doesn't care about where this went unless until you actually completely screwed up the response but otherwise this router has done its job now one of the reason people do not use router is also because router also sometimes increases cost the way you implement but in this particular setup router does not increase the cost massively because the router is also a very small llm and that is why also a lot of people do not use llms for routers in the first place because see imagine you have like 100,000 token and uh you have to send it to gp4 let say so one you're going to make 100,000 token here but at the same time the same 100,000 token you're going to send it here so ultimately you're going to be built for 200,000 token so sometimes people do not want to do that they just go with gp4 that's why it's very key for you to decide what is that uh particular router or setup that you want to use but according to Route llm they have said that your cost can go by almost like 22 times and that is their main pitch they saying that okay your cost calls can reduce by 50% that means like instead of spending let's say $100,000 you can spend $50,000 now assuming that all these calls that you are going to send to gp4 and again this may not be completely uh relevant for you if you are somebody who only uses a 7 million parameter model or maybe you using only a quan model but the approach could be really helpful for you to decide how do you want to Stack these llms for your LM application so let's begin with what they have done so they have trained four routers uh with the data so we know what is the data that they have used the chatbot Arena data the data augmentation and then the four types of routers that they trained as you can see here one is Randomness completely randomly you send it it's like you know you have given a button to a monkey and the monkey is going to decide when something should be sent to which llm that's it completely random nothing uh unless until the monkey is uh let's say strapped with neural link Maybe mus would say that an intelligent monkey but completely random is this central line but then they have got four kinds of routers the first one is a similarity weighted s SW ranking router that performs a weighted ELO calculation based on similarity so now ELO is something that we have already seen it's kind of a ranking mechanism it's it is how tennis players are ranked it is how chess players are ranked it is how chatbot also ranks the model so it's a weighted ELO calculation and then the rout decides based on the similarity of the prompt and it calculates the ELO calculation and based on the ELO score the top rank model gets the prompt this is the most probably the simplest approach that you can do create a ranking mechanism in this case they have created a similarity weighted ranking router the second one is a quite interesting approach for anybody who is watching here who has got a background in a recommendation engine recommendation system might already connect with this a matrix factorization model that learns a scoring function for how well a model can answer a prompt this is almost the foundation of how Netflix would recommend a particular show or a particular movie to you so you have certain preference Netflix show has certain attributes and it recommends you based on this and if you want to learn more about Matrix factorization I would link this Google uh documentation or tutorial in the YouTube description so you can see that there are four different kind of people there are four different kind of or five different kind of movies and based on the preference of existing people you recommend this movie to a new person and that is almost what happens in a recommendation engine so Matrix factorization is a technique which is like a very simple embedding technique you create a representation so what they have done is they have created a router that uh that is basically a matrix factorization model that learns a scoring function for how well a model can answer a prompt something like this and based on the score you wrote it the third one is another approach a deep learning based approach a bir classifier that predicts which model can provide a better response you can use a bird classifier that they' have done here but if you want the simplest approach like build an XG boost model that can probably tell you which router it should go with the F sorry which model it should go with the final one is a causal llm classifier that also predicts which model can provide a better response so now if you see this ranking this will be the costliest model this will be the cheapest model and somewhere in the middle you're talking about a deep learning based model and also Matrix factorization so this was actually using an llm for a router but all these techniques so this is using a deep learning based model Like a Bird model this is not even using a deep learning based model but the fundamentals of that and this is simply using a ranking mechanism so these are the four methods that they have used to build a router and when you see the percentage of calls that uh were gone to gbd4 if you see the bird based models you can see before augmentation the bir based model favored Mixel most while the ranking method favored gp4 most so you can see the top one is gp4 this one is Mixel so it's a scale between mixol and gp4 and then you can see that the number of calls to gp4 has reduced like I said this is again if your basine is gp4 then this is an excellent approach but if you're if if you are not using gp4 already if you're using a cheaper model maybe this is not good enough for you and again there are certain comparisons between route llm versus other commercial offerings like for example youve got the rout LM percentage of calls to gp4 and uh you can see the Baseline is gp4 here to Lama 2 model and there is a tool called Martian so Martian stays here and their claim is that route llm is better than Martian so you have got UniFi AI which is another router it is also an open source library but also I think they've got like a cloud offering so unify AI is somewhere here but again the caal LM and uh the Matrix factorization is somewhere here so when you put together all these things uh one important thing that you have to understand is you are building a router not just to save cost but also you don't want to compromise on the accuracy or the scores that it is going to provide so that is very important because you are trying to significantly reduce the cost but without compromising the quality so cost reductions over 85% on mty bench 45% on mlu while 35% on gsm 8K as compared to one gp4 again this is not using gp4 turbo as a baseline this is using gp4 but overall I would say this is a very interesting approach so you have got the cost in log scale in the x-axis youve got the model performance in the ya axis and youve got models all over the place and the ideal router is somewhere here so the cost is much lesser than even GPT 40 but it is somewhere closer to Gemini 1.5 flash somewhere closer to Claude 3.53 hu so Claude 3.5 would come come somewhere here it's even better than GPT 3.5 turbo but in terms of model performance it's far above the mistal Mixel model the GPT 3.5 turbo model uh surprisingly they didn't compare it with Quinn or like deep SE coder but something that I would love to do they've also released a detailed paper explaining all their techniques and they've also kindly shared the model in itself for us to use it theyve partnered with any scale for this and uh the model is also available open source on hugging face if you want to use it you can right away start using it I might put together a Hands-On tutorial about how to use this llm router for you to use uh within your production application but until that I think this is an excellent application some people have started standardizing and like you see here there are already paid offerings tools that are available here to make people already optimize their cost and like I said it's not only about cost you need fallback mechanism if you want to build a robust software application on top of llms and this is an excellent way to reduce the cost while also not compromising on the quality while also not relying on one single Monopoly llm so this is an excellent solution thank you LM lmis for providing this see you in another video Happy prompting

Original Description

LLM routing offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing. To tackle this, we present RouteLLM, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance. We also publicly release all our code and datasets, including a new open-source framework for serving and evaluating LLM routers. 🔗 Links 🔗 RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing https://lmsys.org/blog/2024-07-01-routellm/ ❤️ If you want to support the channel ❤️ Support here: Patreon - https://www.patreon.com/1littlecoder/ Ko-Fi - https://ko-fi.com/1littlecoder 🧭 Follow me on 🧭 Twitter - https://twitter.com/1littlecoder Linkedin - https://www.linkedin.com/in/amrrs/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from 1littlecoder · 1littlecoder · 0 of 60

← Previous Next →
1 How to create your Free Data Science Blog on Github with Fastpages from Fastai
How to create your Free Data Science Blog on Github with Fastpages from Fastai
1littlecoder
2 Making Interactive Matplotlib Plots for Data Science Visualizations on Jupyter (Python)
Making Interactive Matplotlib Plots for Data Science Visualizations on Jupyter (Python)
1littlecoder
3 Create your first Data Science Web App using R Shiny
Create your first Data Science Web App using R Shiny
1littlecoder
4 How to create a Reproducible Example in R using reprex
How to create a Reproducible Example in R using reprex
1littlecoder
5 No Code Visualization using esquisse with Tableau-like Drag and Drop GUI in R
No Code Visualization using esquisse with Tableau-like Drag and Drop GUI in R
1littlecoder
6 Scrape HTML Table using rvest and Process them for insights using tidyverse in R
Scrape HTML Table using rvest and Process them for insights using tidyverse in R
1littlecoder
7 Google Teachable Machine Learning Build No Code AI solution
Google Teachable Machine Learning Build No Code AI solution
1littlecoder
8 Create meaningful fake tidy datasets in R using fakir [#rstats Package]
Create meaningful fake tidy datasets in R using fakir [#rstats Package]
1littlecoder
9 How to enable using R Programming with Visual Studio VS Code
How to enable using R Programming with Visual Studio VS Code
1littlecoder
10 Python, Community, Books - with Abhiram R - Bangpypers Co-organizers | 1littlecoder podcast
Python, Community, Books - with Abhiram R - Bangpypers Co-organizers | 1littlecoder podcast
1littlecoder
11 Growing a Tech Community across India - Anubha Maneshwar, Founder Girlscript | 1littlecoder Podcast
Growing a Tech Community across India - Anubha Maneshwar, Founder Girlscript | 1littlecoder Podcast
1littlecoder
12 Intro to Google Colab - How to use Colab
Intro to Google Colab - How to use Colab
1littlecoder
13 Intro to Plotly Express - Complex Interactive Charts with One-Line of Python Code
Intro to Plotly Express - Complex Interactive Charts with One-Line of Python Code
1littlecoder
14 Indic NLP Python Toolkit Open Source Development - iNLTK Creator Gaurav Arora | 1littlecoder Podcast
Indic NLP Python Toolkit Open Source Development - iNLTK Creator Gaurav Arora | 1littlecoder Podcast
1littlecoder
15 Do you want a career in Data Science - Tamil Webinar
Do you want a career in Data Science - Tamil Webinar
1littlecoder
16 Android Smartphone Analysis in R [Live Coding Screencast]
Android Smartphone Analysis in R [Live Coding Screencast]
1littlecoder
17 Programmatically create Images, Memes, Watermarks using Python with imgmaker
Programmatically create Images, Memes, Watermarks using Python with imgmaker
1littlecoder
18 Kaggle Walkthrough to get you started with Data Science - Webinar
Kaggle Walkthrough to get you started with Data Science - Webinar
1littlecoder
19 Community, Corporate Job, Coding - Gnana Lakshmi T C aka Gyan, WomenWhoCode Leadership Fellow
Community, Corporate Job, Coding - Gnana Lakshmi T C aka Gyan, WomenWhoCode Leadership Fellow
1littlecoder
20 Easy ggplot2 Theme Customization with {ggeasy} | Data Visualization in R
Easy ggplot2 Theme Customization with {ggeasy} | Data Visualization in R
1littlecoder
21 Excel to R - Pivot + Bar Chart in Excel  & R using tidyverse [Live Coding]
Excel to R - Pivot + Bar Chart in Excel & R using tidyverse [Live Coding]
1littlecoder
22 Excel to R #2 - VLOOKUP in Excel to LEFT_JOIN, MERGE in R
Excel to R #2 - VLOOKUP in Excel to LEFT_JOIN, MERGE in R
1littlecoder
23 5 websites to get Free Real-World Datasets for Data Science/ML Projects
5 websites to get Free Real-World Datasets for Data Science/ML Projects
1littlecoder
24 Excel to R #3 - APPROXIMATE VLOOKUP in Excel to FUZZY LEFT_JOIN in R
Excel to R #3 - APPROXIMATE VLOOKUP in Excel to FUZZY LEFT_JOIN in R
1littlecoder
25 Correlation-alternative PPS (Predictive Power Score) Python Package Demo
Correlation-alternative PPS (Predictive Power Score) Python Package Demo
1littlecoder
26 Automated Website Screenshots in R using {webshot}
Automated Website Screenshots in R using {webshot}
1littlecoder
27 Installing Custom RStudio Theme (Synthwave85)
Installing Custom RStudio Theme (Synthwave85)
1littlecoder
28 Analyse Google Trends Search Data in R using {gtrendsR}
Analyse Google Trends Search Data in R using {gtrendsR}
1littlecoder
29 3 Tips to ask question on Stack Overflow the right way to get answers
3 Tips to ask question on Stack Overflow the right way to get answers
1littlecoder
30 Learn Data Science with R - Mini Projects - Web Scraping Zomato
Learn Data Science with R - Mini Projects - Web Scraping Zomato
1littlecoder
31 Easily make Dumbbell Chart using {ggcharts} | Data Visualization in R
Easily make Dumbbell Chart using {ggcharts} | Data Visualization in R
1littlecoder
32 GET Hackernews Front Page Results using REST API in R
GET Hackernews Front Page Results using REST API in R
1littlecoder
33 Quickly deploy ML WebApps from Google Colab using ngrok
Quickly deploy ML WebApps from Google Colab using ngrok
1littlecoder
34 Use Jupyter Notebooks within VSCode (Visual Studio Code) in 2020
Use Jupyter Notebooks within VSCode (Visual Studio Code) in 2020
1littlecoder
35 Plotly Interactive Plots as Pandas Plotting Backend df.plot()
Plotly Interactive Plots as Pandas Plotting Backend df.plot()
1littlecoder
36 Stack Overflow Developer Survey 2020 Highlights for New Programmers
Stack Overflow Developer Survey 2020 Highlights for New Programmers
1littlecoder
37 Matplotlib Animation Charts in Python using Celluloid
Matplotlib Animation Charts in Python using Celluloid
1littlecoder
38 Coding, Postwoman, Passion Project Book - Liyas Thomas Open Source Developer - 1littlecoder podcast
Coding, Postwoman, Passion Project Book - Liyas Thomas Open Source Developer - 1littlecoder podcast
1littlecoder
39 Aspiring Data Scientist, Tips on How to learn Business Domain Knowledge
Aspiring Data Scientist, Tips on How to learn Business Domain Knowledge
1littlecoder
40 Bokeh Interactive Charts as Pandas Plotting Backend df.plot_bokeh()
Bokeh Interactive Charts as Pandas Plotting Backend df.plot_bokeh()
1littlecoder
41 Easy Fast Python Pandas Summary with Sidetable | Pandas Tips & Tricks
Easy Fast Python Pandas Summary with Sidetable | Pandas Tips & Tricks
1littlecoder
42 Inception, Content Ideas, Consistency - Srivatsan Srinivasan AIEngineering YouTube Content Creator
Inception, Content Ideas, Consistency - Srivatsan Srinivasan AIEngineering YouTube Content Creator
1littlecoder
43 ggplot2 Text Customization with ggtext | Data Visualization in R
ggplot2 Text Customization with ggtext | Data Visualization in R
1littlecoder
44 Penguins Dataset Overview - iris alternative | EDA Data Visualization in R
Penguins Dataset Overview - iris alternative | EDA Data Visualization in R
1littlecoder
45 YouTube Growth Tips, Content Creation - Bhavesh Bhatt, YouTuber (Data Science & Machine Learning) #7
YouTube Growth Tips, Content Creation - Bhavesh Bhatt, YouTuber (Data Science & Machine Learning) #7
1littlecoder
46 Matplotlib Animated Bar Chart Race in Python | Data Visualization
Matplotlib Animated Bar Chart Race in Python | Data Visualization
1littlecoder
47 Simple Python GUI Development using {guietta}
Simple Python GUI Development using {guietta}
1littlecoder
48 #8 Niche, Growth, Monetization - David Langer - YouTuber Dave on Data
#8 Niche, Growth, Monetization - David Langer - YouTuber Dave on Data
1littlecoder
49 Simple Fast 3-step Python OCR using Deep Learning 40+ Languages
Simple Fast 3-step Python OCR using Deep Learning 40+ Languages
1littlecoder
50 Github New Feature Profile Summary/Mini-Resume - Profile Views
Github New Feature Profile Summary/Mini-Resume - Profile Views
1littlecoder
51 Otto ML Assistant, GPT-3 on Philosophers, Nvidia-ARM - 3 ML Tech News
Otto ML Assistant, GPT-3 on Philosophers, Nvidia-ARM - 3 ML Tech News
1littlecoder
52 What is OpenAI GPT-3 - Hype, Examples, Worries
What is OpenAI GPT-3 - Hype, Examples, Worries
1littlecoder
53 Julia 1.5, Datamuse API, Live HDR+ Pixel 4a - Machine Learning Tech News
Julia 1.5, Datamuse API, Live HDR+ Pixel 4a - Machine Learning Tech News
1littlecoder
54 Self-driving Car Engineer sentenced, arXiv Dataset, AI/ML Startup Idea - Machine Learning Tech News
Self-driving Car Engineer sentenced, arXiv Dataset, AI/ML Startup Idea - Machine Learning Tech News
1littlecoder
55 GPT-3 Explorer, Ciphey (Automated Decryption), Py-Sudoku - ML Tech News
GPT-3 Explorer, Ciphey (Automated Decryption), Py-Sudoku - ML Tech News
1littlecoder
56 How to use Advanced Google Search to extract Email Ids from Linkedin
How to use Advanced Google Search to extract Email Ids from Linkedin
1littlecoder
57 Cartoonizer Toon-IT (AI Web App), GPT-3 Advice, Android Earthquake Detection - ML Tech News
Cartoonizer Toon-IT (AI Web App), GPT-3 Advice, Android Earthquake Detection - ML Tech News
1littlecoder
58 Flow - R Package to visualize code logic, functions as a Flow Diagram
Flow - R Package to visualize code logic, functions as a Flow Diagram
1littlecoder
59 Build GPT-3-like Language Model on Google Colab with minGPT [PyTorch]
Build GPT-3-like Language Model on Google Colab with minGPT [PyTorch]
1littlecoder
60 Create a Pencil Sketch Portrait with Python OpenCV
Create a Pencil Sketch Portrait with Python OpenCV
1littlecoder

This video teaches how to efficiently process queries using LLM routing, minimizing cost while maintaining quality, and provides an overview of various techniques and tools used in LLM routing, including similarity weighted SW ranking, matrix factorization, and deep learning-based approaches. By the end of this video, viewers will understand how to build LLM routers, implement cost-effective LLM routing, and design LLM architectures. The video also highlights the importance of prompt-based routi

Key Takeaways
  1. Build an LLM router using a small LLM or a custom classifier
  2. Implement a similarity weighted SW ranking router
  3. Use a matrix factorization model to learn a scoring function for how well a model can answer a prompt
  4. Train a deep learning-based model to predict which model can provide a better response
  5. Use a causal LLM classifier to predict which model can provide a better response
  6. Build an XGBoost model to select the best model
  7. Compare different LLM routers and commercial offerings
  8. Implement a fallback mechanism for robust software application
💡 LLM routing can reduce cost by 50% compared to using a single large model, while maintaining quality, by directing queries to the most suitable LLM based on prompt characteristics and cost considerations.

Related AI Lessons

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Learn how to compare large language models like Sarvam-30B and Qwen2.5-14B on the Spider Text-to-SQL benchmark from an active-parameter perspective
Medium · LLM
Claude Sonnet 5 closes the gap to Opus without the Opus bill
Claude Sonnet 5 emerges as a cost-effective alternative to Opus, learn how it closes the gap without the hefty price tag
Medium · LLM
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →