Hosting Models at Scale

Outerbounds · Beginner ·📐 ML Fundamentals ·2y ago

Skills: LLM Engineering90%

Key Takeaways

The video discusses Metaflow hosting for internal use cases at Netflix, utilizing tools like Metaflow, Flask, and Titus to reduce the gap between data scientists and infra teams. It also covers the use of Open FAS as a serverless framework, GraphQL backend, and async hosting for long-running jobs.

Full Transcript

thank you uh so we have one more talk um you may have heard through some of the previous talks uh speakers referring to metaflow hosting which is not something in open source but you know it's similar to what Bento ml would do that has been described as well uh so shashan will from the Netflix metaflow team as well will describe what metaflow hosting is and how we use it thank you um hi so I'm shashan and we have talked a lot about metaflow hosting in previous talks like aliki stock mentioned how they use a leverage metaflow hosting in order to do predictions for their DSC projects and then we also had the Amber team that is the media ml team talking about how they leverage async hosting for async endpoints for computing Amber features and we are just going to talk about the infrastructure behind metaflow hosting in the stock so the first question is why do we need metaflow hosting and yeah so when you think about machine learning at Netflix the first thought that comes to your mind is recommendations or predicting or finding out which movie a user would want to watch and there is indeed a lot of research or machine learning being used to do the same but this lies directly on the consumer side of inference metaflow hosting on the other hand is used for internal use cases it's based on a python hosting model that is flask framework so it's definitely not meant for Consumer Scale Models the places where metaflow hosting is useful is for internal UI tools like identifying the quality of network or performing machine translation for subtitles stuff like that and this is where metaflow hosting comes into play the other reason why we need metaflow hosting is that there's a lot of difference in what data scientists expect uh expect the code to be and what INF teams need to provide them so there's this nice cartoon or comic about every cloud architecture in Big Tex you have like these cool databases and then the mismanaged services or like the unmanageable services and there's also your good old data leaks which sometimes leak and lead into a data swamp but the point is that as a data scientist you don't want to interact with all of this all you care about is train a machine learning model write some simple code and then when actually you deploy a model you want to do some Quest tracing and be able to debug your model easily on the other hand the infra teams care about some aspects mentioned here such as load balancing scaling and an infr team might also care about like how do we bake the docker images like how do we configure the Gateway so that the end points are rooted correctly and what's the underlying infra for the infr team within Netflix such as are we using kubernetes or are we using ec2 S3 and so on so metaflow hosting kind of tries to reduce this gap between the infr teams and data scientists by providing a simple mechanism for data scientists to De to Define Services which they can then use to deploy the machine learning models or any AR arbitrary rest end points so what is metaflow hosting before we go with that like um I assuming that a lot of folks are familiar with metaflow but maybe some aren't so at a high level a metaflow is just a simple class which represents a dag each function or step is what we call it represents a node in this dag and you can use this dag to execute code in containers or you could also execute it locally and metaflow internally as well as an open source provides a lot of features like you can schedule your flows across or orchestrators like in OSS you can schedule flows in arbo on the other hand we have some internal orchestrators like Maestro within Netflix and then for each of these steps you can also specify the resources at Titus is just uh orchestration platform we have within Netflix which is based on Amazon ec2 as well so the other important thing to note is that each step in your metaflow flow can generate these things known as artifacts and these artifacts can then be used in your deployed models so the way a deployed model can make use of artifact is shown here this is the entirety of code that a data scientist needs to write in order to deploy the service across multiple instances have load balancing as well as proper request tracing and observability metrics so you have the example web service defined which inherits a particular class and you get specify the resources needed as well as the autoscaling params like minimum or maximum instances then you can use this function called inet app to initialize your application with whatever relevant models you need or you could also specify if you need some particular package finally we have this at endpoint decorator where you specify the name of your endpoint followed by accessing the artifact so this artifact could be a machine learning model that you just train via a metaflow flow like shown in the previous slide and in this function you have this request dictionary wherein you can get the Json body for your HTTP request and then run your model on it to get a response and deploying the web service is just as easy as running this command wherein you specify the flow that you want your service to be associated with and the definition of your service metaflow hosting provides a lot of features like this is what I mean by simple definition and simple deployment there's also the notion of dependency management where you can specify the environment name and that basically installs the relevant pip packages or cond packages needed for your service we support Auto scaling which is like scaling according to the number of requests there's versioning traceability and support for ASN requests so this is what a deployed model looks like you have the model deployed across multiple instances at the top when you make a single request to an endpoint you can trace the request via elastic search indexes and we also support tracing via Edgar or Zipkin you have logging via radar wherein you can see how your service is performing with respect to the status codes and finally you have the good old metrics which you can use to find out the status of your service whether it's up or not and this is actually used by US to do the Autos scaling as well like if we see a lot of requests coming in then we have a job or a service which can autoscale your service so moving on to implementation details there are multiple ways of implementing a metaflow hosting application internally but there are pros and quants for each of them so we considered DNS redirection as one of the options but we want to look at all of these options based on these four criterias consistency simply means that when you deploy a new version of your service the redirection should be able to properly redirect to that latest version of service and this is an issue with DNS redirect because it's based on a caching mechanism and if you just deployed a new service the cach would not be updated immediately and if the user makes a request to the latest version of endpoint they might still be rooting the request to an older version the other criteria we are interested in is low maintenance and that's particularly important for us because we are a small team and we do not want to maintain the infrastructure for a lot of services so while proxy server and open fast both support uh lot of these features open fast was the best for us because it's low maintenance because there's another team within Netflix maintaining it for us and then there's some other features like logging tracing and load balancing which is supported by both of these and for those unfamiliar with open FAS open FAS is an open source framework for like serverless functions kind of functions like AWS Lambda but and open source this can be based on any back end so you could have kubernetes as a back end or in our case we use Titus as a backend Titus is also another orchestration platform within Netflix so yeah we chose open fast for our metaflow hosting implementation the metaflow hosting implementation consists of two main aspects one is the control plane and one is the metaflow client or the container so the control plane itself is written and go it's highly performant and it is responsible for creating the instances for a new service and control plane is also the service which Roots a query from the users's machine to a particular microservice or endpoint the control plane also does other fancy stuff like generating a Swagger for the user and it's also responsible for maintaining state of the World by maintaining state of the world we mean that it needs to know which versions exist right now which versions can be deployed or undeployed and which version corresponds to the latest production version on the other hand we have metaflow client and containers which is the the code that the data scientists actually interact with so the metaflow client essentially downloads and packages the users code within a Titus instance and then it fetches the relevant artifacts like I mentioned earlier we then do some user directed initialization and dependency management that is download the relevant packages and finally we serve the HTTP endpoints using flask and fatdog and F Watchdog is also based on go in order for it to be a simple server that is highly performant so going over this quickly like we have open fast Gateway which we fondly call Barkley internally and based on this Gateway we create another service called the control plane that I mentioned in the previous slide and then this control plane interacts with a RDS instance which is a database which stores the state for all our hosting functions and let's go over the deployment path so when a user deploys a new service we first go to the open fast Gateway which then makes a call to the control plane the control plane then fetches the relevant information from the database to find out whether the service has already been deployed and we're deploying a new service or whether we want to create a completely new service after that the control plane returns back to the open fast Gateway which interacts with our orchestration platform form to actually start the services needed on the other hand we have yeah so this is what finally deploys the hosted microservice and you can have multiple instances of the same now going over the query path if read his strings wants to query a particular endpoint within our hosted application the query first goes to the Gateway which again goes to the control plane which uses the DB to root it to the appropriate end point goes back back to the Gateway again and then actually makes the call to the micros service or the hosted front end and finally we get the response back so that's metaflow hosting at uh at a high level and then we continue developing metaflow hosting internally like one of the features that was added recently was async hosting wherein you can have long running jobs so in general metaflow hosting supports a request for a period of 20 minutes but with async hosting we support jobs that can run for as long as 12 hours this is particularly useful for the media ml team because they run inference on movies which are pretty large or long and take a long time to perform to execute and the reason for the 12-hour time limit is because of the sqsq which is what we used in our implementation and this is an example of of how you can use callink to get the response from an e sync request but you can also have a simple call back function to do whatever you want and one of the latest features that we integrated into the metaflow hosting pipeline is adding support for a graphql backend uh why do you want support for graphql backend at a big company like Netflix you can have multiple uh microservices owned by different teams as a client UI engineer who defines or creates front ends you don't want to make calls to like 10 different microservices in order to get the response for making a UI page having a Federated graphql Edge allows you to make one single query and then this Federated Gateway allows you to root the queries to the appropriate microservices get back the responses from each one of them it collates them and then returns a unified response back to the client UI engineer so in order to support a Federated Gateway we recently added support for defining a graph Q backend in metaflow and it's pretty seamless for a user to add this there is literally no change in their service code like the service code Remains the Same except within their endpoint they can Define the name for the graph C endpoint as well as the input type and and output type graphql is a typed language unlike python so you need to Define the types like here and then we do all the associated stuff like deploying it to the Federated Edge and that's it any questions thank you

Original Description

This spring at Netflix HQ in Los Gatos, we hosted an ML and AI mixer that brought together talks, food, drinks, and engaging discussions on the latest in machine learning, infrastructure, LLMs, and foundation models. This talk was by Shashank Srikanth, Netflix.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 0 of 60

← Previous Next →

Metaflow GUI for monitoring machine learning workflows

Metaflow GUI for monitoring machine learning workflows

Metaflow Cards [no sound]

Metaflow Cards [no sound]

Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning

Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning

Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning

Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning

Metaflow on Kubernetes and Argo Workflows [no sound]

Metaflow on Kubernetes and Argo Workflows [no sound]

Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK

Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK

Metaflow Tags: Programmatic Tagging

Metaflow Tags: Programmatic Tagging

Metaflow Tags: Basic Tagging

Metaflow Tags: Basic Tagging

Metaflow Tags: Tags in CI/CD

Metaflow Tags: Tags in CI/CD

Metaflow Tags: Tags and Namespaces

Metaflow Tags: Tags and Namespaces

Metaflow Tags: Tags and Continuous Training

Metaflow Tags: Tags and Continuous Training

Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People

Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People

Fireside Chat #5: Machine Learning + Infrastructure for Humans

Fireside Chat #5: Machine Learning + Infrastructure for Humans

Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser

Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser

Metaflow on Azure

Metaflow on Azure

Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners

Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners

ML engineering vs traditional software engineering: similarities and differences

ML engineering vs traditional software engineering: similarities and differences

Why data scientists love and hate notebooks: velocity and validation

Why data scientists love and hate notebooks: velocity and validation

What even is a 10x ML engineer?

What even is a 10x ML engineer?

The 4 main tasks in the production ML lifecycle

The 4 main tasks in the production ML lifecycle

Is the premise of data-centric AI flawed?

Is the premise of data-centric AI flawed?

The 3 factors that Determine the success of ML projects

The 3 factors that Determine the success of ML projects

Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch

Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch

Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]

Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]

Metaflow on GCP

Metaflow on GCP

Fireside Chat #8: Navigating the Full Stack of Machine Learning

Fireside Chat #8: Navigating the Full Stack of Machine Learning

How to Build a Full-Stack Recommender System

How to Build a Full-Stack Recommender System

Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]

Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]

Easy Airflow DAGs for ML and data science with Metaflow [no sound]

Easy Airflow DAGs for ML and data science with Metaflow [no sound]

Fireside chat #9: Language Processing: From Prototype to Production

Fireside chat #9: Language Processing: From Prototype to Production

How to build end-to-end recommender systems at reasonable scale

How to build end-to-end recommender systems at reasonable scale

Full-Stack Machine Learning with Metaflow on CoRise

Full-Stack Machine Learning with Metaflow on CoRise

Natural Language Processing meets MLOps

Natural Language Processing meets MLOps

Fireside Chat #10: Large Language Models: Beyond Proofs of Concept

Fireside Chat #10: Large Language Models: Beyond Proofs of Concept

What even are Large Language Models?

What even are Large Language Models?

How to get started with LLMs today

How to get started with LLMs today

LLMs in production

LLMs in production

Accessing secrets securely in Metaflow [no audio]

Accessing secrets securely in Metaflow [no audio]

Fireside Chat #11: The Open-Source Modern Data Stack

Fireside Chat #11: The Open-Source Modern Data Stack

Fireside chat #12: Kubernetes for Data Scientists

Fireside chat #12: Kubernetes for Data Scientists

Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster

Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster

Fireside chat #13: Supply Chain Security in Machine Learning

Fireside chat #13: Supply Chain Security in Machine Learning

Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story

Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story

Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai

Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai

Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration

Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration

From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo

From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo

Building a GenAI Ready ML Platform with Metaflow at Autodesk

Building a GenAI Ready ML Platform with Metaflow at Autodesk

Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis

Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis

Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform

Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform

Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming

Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming

The Past, Present, and Future of Generative AI

The Past, Present, and Future of Generative AI

Building Production Systems with Generative AI, Machine Learning, and Data

Building Production Systems with Generative AI, Machine Learning, and Data

A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)

Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)

Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)

Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)

Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)

Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)

Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)

Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)

Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)

Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)

Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)

LLMs in Practice: A Guide to Recent Trends and Techniques

LLMs in Practice: A Guide to Recent Trends and Techniques

Metaflow for distributed high-performance computing and large-scale AI training

Metaflow for distributed high-performance computing and large-scale AI training

This video teaches how to host models at scale using Metaflow, a Python hosting model with a Flask framework. It covers the use of Open FAS as a serverless framework, GraphQL backend, and async hosting for long-running jobs. By the end of this video, viewers will be able to deploy LLMs at scale and implement Metaflow hosting.

Key Takeaways

Specify resources needed and autoscaling parameters in a Metaflow flow
Use the inet app function to initialize an application with relevant models
Decorate an endpoint to specify the name of the endpoint and access artifacts
Deploy a web service by running a command to specify the flow and service definition
Create instances for a new service
Root a query from the user's machine to a particular microservice or endpoint
Generate a Swagger for the user
Maintain state of the world by knowing which versions exist, which versions can be deployed or undeployed, and which version corresponds to the latest production version
Download and package the user's code within a Titus instance

💡 Metaflow hosting reduces the gap between data scientists and infra teams by providing a simple mechanism for data scientists to define services, and async hosting allows jobs to run for up to 12 hours, making it useful for long-running jobs such as inference on large movies.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

ChronoCast : The Time Series project

Learn about ChronoCast, a time series analysis project for understanding and learning, and how to apply its concepts to improve forecasting models

Medium · Machine Learning

Gate on what the model can't author (my comment section redesigned my trust model)

Redesign your trust model by identifying features with external sources, as seen in a comment section discussion on an email classifier's scoring system

Your gradient dies on the way to layer 1 (and how to save it)

Learn how to address the vanishing gradient problem in deep neural networks and improve training efficiency

Dev.to · Devanshu Biswas

AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier

Learn how to implement AdaBoost from scratch and understand how it combines weak models to create a strong classifier

Dev.to · Devanshu Biswas

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB