Hosting Models at Scale

Outerbounds · Beginner ·📐 ML Fundamentals ·2y ago

Key Takeaways

The video discusses Metaflow hosting for internal use cases at Netflix, utilizing tools like Metaflow, Flask, and Titus to reduce the gap between data scientists and infra teams. It also covers the use of Open FAS as a serverless framework, GraphQL backend, and async hosting for long-running jobs.

Full Transcript

thank you uh so we have one more talk um you may have heard through some of the previous talks uh speakers referring to metaflow hosting which is not something in open source but you know it's similar to what Bento ml would do that has been described as well uh so shashan will from the Netflix metaflow team as well will describe what metaflow hosting is and how we use it thank you um hi so I'm shashan and we have talked a lot about metaflow hosting in previous talks like aliki stock mentioned how they use a leverage metaflow hosting in order to do predictions for their DSC projects and then we also had the Amber team that is the media ml team talking about how they leverage async hosting for async endpoints for computing Amber features and we are just going to talk about the infrastructure behind metaflow hosting in the stock so the first question is why do we need metaflow hosting and yeah so when you think about machine learning at Netflix the first thought that comes to your mind is recommendations or predicting or finding out which movie a user would want to watch and there is indeed a lot of research or machine learning being used to do the same but this lies directly on the consumer side of inference metaflow hosting on the other hand is used for internal use cases it's based on a python hosting model that is flask framework so it's definitely not meant for Consumer Scale Models the places where metaflow hosting is useful is for internal UI tools like identifying the quality of network or performing machine translation for subtitles stuff like that and this is where metaflow hosting comes into play the other reason why we need metaflow hosting is that there's a lot of difference in what data scientists expect uh expect the code to be and what INF teams need to provide them so there's this nice cartoon or comic about every cloud architecture in Big Tex you have like these cool databases and then the mismanaged services or like the unmanageable services and there's also your good old data leaks which sometimes leak and lead into a data swamp but the point is that as a data scientist you don't want to interact with all of this all you care about is train a machine learning model write some simple code and then when actually you deploy a model you want to do some Quest tracing and be able to debug your model easily on the other hand the infra teams care about some aspects mentioned here such as load balancing scaling and an infr team might also care about like how do we bake the docker images like how do we configure the Gateway so that the end points are rooted correctly and what's the underlying infra for the infr team within Netflix such as are we using kubernetes or are we using ec2 S3 and so on so metaflow hosting kind of tries to reduce this gap between the infr teams and data scientists by providing a simple mechanism for data scientists to De to Define Services which they can then use to deploy the machine learning models or any AR arbitrary rest end points so what is metaflow hosting before we go with that like um I assuming that a lot of folks are familiar with metaflow but maybe some aren't so at a high level a metaflow is just a simple class which represents a dag each function or step is what we call it represents a node in this dag and you can use this dag to execute code in containers or you could also execute it locally and metaflow internally as well as an open source provides a lot of features like you can schedule your flows across or orchestrators like in OSS you can schedule flows in arbo on the other hand we have some internal orchestrators like Maestro within Netflix and then for each of these steps you can also specify the resources at Titus is just uh orchestration platform we have within Netflix which is based on Amazon ec2 as well so the other important thing to note is that each step in your metaflow flow can generate these things known as artifacts and these artifacts can then be used in your deployed models so the way a deployed model can make use of artifact is shown here this is the entirety of code that a data scientist needs to write in order to deploy the service across multiple instances have load balancing as well as proper request tracing and observability metrics so you have the example web service defined which inherits a particular class and you get specify the resources needed as well as the autoscaling params like minimum or maximum instances then you can use this function called inet app to initialize your application with whatever relevant models you need or you could also specify if you need some particular package finally we have this at endpoint decorator where you specify the name of your endpoint followed by accessing the artifact so this artifact could be a machine learning model that you just train via a metaflow flow like shown in the previous slide and in this function you have this request dictionary wherein you can get the Json body for your HTTP request and then run your model on it to get a response and deploying the web service is just as easy as running this command wherein you specify the flow that you want your service to be associated with and the definition of your service metaflow hosting provides a lot of features like this is what I mean by simple definition and simple deployment there's also the notion of dependency management where you can specify the environment name and that basically installs the relevant pip packages or cond packages needed for your service we support Auto scaling which is like scaling according to the number of requests there's versioning traceability and support for ASN requests so this is what a deployed model looks like you have the model deployed across multiple instances at the top when you make a single request to an endpoint you can trace the request via elastic search indexes and we also support tracing via Edgar or Zipkin you have logging via radar wherein you can see how your service is performing with respect to the status codes and finally you have the good old metrics which you can use to find out the status of your service whether it's up or not and this is actually used by US to do the Autos scaling as well like if we see a lot of requests coming in then we have a job or a service which can autoscale your service so moving on to implementation details there are multiple ways of implementing a metaflow hosting application internally but there are pros and quants for each of them so we considered DNS redirection as one of the options but we want to look at all of these options based on these four criterias consistency simply means that when you deploy a new version of your service the redirection should be able to properly redirect to that latest version of service and this is an issue with DNS redirect because it's based on a caching mechanism and if you just deployed a new service the cach would not be updated immediately and if the user makes a request to the latest version of endpoint they might still be rooting the request to an older version the other criteria we are interested in is low maintenance and that's particularly important for us because we are a small team and we do not want to maintain the infrastructure for a lot of services so while proxy server and open fast both support uh lot of these features open fast was the best for us because it's low maintenance because there's another team within Netflix maintaining it for us and then there's some other features like logging tracing and load balancing which is supported by both of these and for those unfamiliar with open FAS open FAS is an open source framework for like serverless functions kind of functions like AWS Lambda but and open source this can be based on any back end so you could have kubernetes as a back end or in our case we use Titus as a backend Titus is also another orchestration platform within Netflix so yeah we chose open fast for our metaflow hosting implementation the metaflow hosting implementation consists of two main aspects one is the control plane and one is the metaflow client or the container so the control plane itself is written and go it's highly performant and it is responsible for creating the instances for a new service and control plane is also the service which Roots a query from the users's machine to a particular microservice or endpoint the control plane also does other fancy stuff like generating a Swagger for the user and it's also responsible for maintaining state of the World by maintaining state of the world we mean that it needs to know which versions exist right now which versions can be deployed or undeployed and which version corresponds to the latest production version on the other hand we have metaflow client and containers which is the the code that the data scientists actually interact with so the metaflow client essentially downloads and packages the users code within a Titus instance and then it fetches the relevant artifacts like I mentioned earlier we then do some user directed initialization and dependency management that is download the relevant packages and finally we serve the HTTP endpoints using flask and fatdog and F Watchdog is also based on go in order for it to be a simple server that is highly performant so going over this quickly like we have open fast Gateway which we fondly call Barkley internally and based on this Gateway we create another service called the control plane that I mentioned in the previous slide and then this control plane interacts with a RDS instance which is a database which stores the state for all our hosting functions and let's go over the deployment path so when a user deploys a new service we first go to the open fast Gateway which then makes a call to the control plane the control plane then fetches the relevant information from the database to find out whether the service has already been deployed and we're deploying a new service or whether we want to create a completely new service after that the control plane returns back to the open fast Gateway which interacts with our orchestration platform form to actually start the services needed on the other hand we have yeah so this is what finally deploys the hosted microservice and you can have multiple instances of the same now going over the query path if read his strings wants to query a particular endpoint within our hosted application the query first goes to the Gateway which again goes to the control plane which uses the DB to root it to the appropriate end point goes back back to the Gateway again and then actually makes the call to the micros service or the hosted front end and finally we get the response back so that's metaflow hosting at uh at a high level and then we continue developing metaflow hosting internally like one of the features that was added recently was async hosting wherein you can have long running jobs so in general metaflow hosting supports a request for a period of 20 minutes but with async hosting we support jobs that can run for as long as 12 hours this is particularly useful for the media ml team because they run inference on movies which are pretty large or long and take a long time to perform to execute and the reason for the 12-hour time limit is because of the sqsq which is what we used in our implementation and this is an example of of how you can use callink to get the response from an e sync request but you can also have a simple call back function to do whatever you want and one of the latest features that we integrated into the metaflow hosting pipeline is adding support for a graphql backend uh why do you want support for graphql backend at a big company like Netflix you can have multiple uh microservices owned by different teams as a client UI engineer who defines or creates front ends you don't want to make calls to like 10 different microservices in order to get the response for making a UI page having a Federated graphql Edge allows you to make one single query and then this Federated Gateway allows you to root the queries to the appropriate microservices get back the responses from each one of them it collates them and then returns a unified response back to the client UI engineer so in order to support a Federated Gateway we recently added support for defining a graph Q backend in metaflow and it's pretty seamless for a user to add this there is literally no change in their service code like the service code Remains the Same except within their endpoint they can Define the name for the graph C endpoint as well as the input type and and output type graphql is a typed language unlike python so you need to Define the types like here and then we do all the associated stuff like deploying it to the Federated Edge and that's it any questions thank you

Original Description

This spring at Netflix HQ in Los Gatos, we hosted an ML and AI mixer that brought together talks, food, drinks, and engaging discussions on the latest in machine learning, infrastructure, LLMs, and foundation models. This talk was by Shashank Srikanth, Netflix.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 0 of 60

← Previous Next →
1 Metaflow GUI for monitoring machine learning workflows
Metaflow GUI for monitoring machine learning workflows
Outerbounds
2 Metaflow Cards [no sound]
Metaflow Cards [no sound]
Outerbounds
3 Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
4 Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
5 Metaflow on Kubernetes and Argo Workflows [no sound]
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
6 Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
7 Metaflow Tags: Programmatic Tagging
Metaflow Tags: Programmatic Tagging
Outerbounds
8 Metaflow Tags: Basic Tagging
Metaflow Tags: Basic Tagging
Outerbounds
9 Metaflow Tags: Tags in CI/CD
Metaflow Tags: Tags in CI/CD
Outerbounds
10 Metaflow Tags: Tags and Namespaces
Metaflow Tags: Tags and Namespaces
Outerbounds
11 Metaflow Tags: Tags and Continuous Training
Metaflow Tags: Tags and Continuous Training
Outerbounds
12 Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
13 Fireside Chat #5: Machine Learning + Infrastructure for Humans
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
14 Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
15 Metaflow on Azure
Metaflow on Azure
Outerbounds
16 Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
17 ML engineering vs traditional software engineering: similarities and differences
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
18 Why data scientists love and hate notebooks: velocity and validation
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
19 What even is a 10x ML engineer?
What even is a 10x ML engineer?
Outerbounds
20 The 4 main tasks in the production ML lifecycle
The 4 main tasks in the production ML lifecycle
Outerbounds
21 Is the premise of data-centric AI flawed?
Is the premise of data-centric AI flawed?
Outerbounds
22 The 3 factors that Determine the success of ML projects
The 3 factors that Determine the success of ML projects
Outerbounds
23 Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
24 Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
25 Metaflow on GCP
Metaflow on GCP
Outerbounds
26 Fireside Chat #8: Navigating the Full Stack of Machine Learning
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
27 How to Build a Full-Stack Recommender System
How to Build a Full-Stack Recommender System
Outerbounds
28 Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
29 Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
30 Fireside chat #9:  Language Processing: From Prototype to Production
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
31 How to build end-to-end recommender systems at reasonable scale
How to build end-to-end recommender systems at reasonable scale
Outerbounds
32 Full-Stack Machine Learning with Metaflow on CoRise
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
33 Natural Language Processing meets MLOps
Natural Language Processing meets MLOps
Outerbounds
34 Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
35 What even are Large Language Models?
What even are Large Language Models?
Outerbounds
36 How to get started with LLMs today
How to get started with LLMs today
Outerbounds
37 LLMs in production
LLMs in production
Outerbounds
38 Accessing secrets securely in Metaflow [no audio]
Accessing secrets securely in Metaflow [no audio]
Outerbounds
39 Fireside Chat #11: The Open-Source Modern Data Stack
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
40 Fireside chat #12: Kubernetes for Data Scientists
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
41 Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
42 Fireside chat #13: Supply Chain Security in Machine Learning
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
43 Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
44 Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
45 Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
46 From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
47 Building a GenAI Ready ML Platform with Metaflow at Autodesk
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
48 Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
49 Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
50 Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
51 The Past, Present, and Future of Generative AI
The Past, Present, and Future of Generative AI
Outerbounds
52 Building Production Systems with Generative AI, Machine Learning, and Data
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
53 A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
54 Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
55 Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
56 Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
57 Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
58 Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
59 LLMs in Practice: A Guide to Recent Trends and Techniques
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
60 Metaflow for distributed high-performance computing and large-scale AI training
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds

This video teaches how to host models at scale using Metaflow, a Python hosting model with a Flask framework. It covers the use of Open FAS as a serverless framework, GraphQL backend, and async hosting for long-running jobs. By the end of this video, viewers will be able to deploy LLMs at scale and implement Metaflow hosting.

Key Takeaways
  1. Specify resources needed and autoscaling parameters in a Metaflow flow
  2. Use the inet app function to initialize an application with relevant models
  3. Decorate an endpoint to specify the name of the endpoint and access artifacts
  4. Deploy a web service by running a command to specify the flow and service definition
  5. Create instances for a new service
  6. Root a query from the user's machine to a particular microservice or endpoint
  7. Generate a Swagger for the user
  8. Maintain state of the world by knowing which versions exist, which versions can be deployed or undeployed, and which version corresponds to the latest production version
  9. Download and package the user's code within a Titus instance
💡 Metaflow hosting reduces the gap between data scientists and infra teams by providing a simple mechanism for data scientists to define services, and async hosting allows jobs to run for up to 12 hours, making it useful for long-running jobs such as inference on large movies.

Related Reads

📰
ChronoCast : The Time Series project
Learn about ChronoCast, a time series analysis project for understanding and learning, and how to apply its concepts to improve forecasting models
Medium · Machine Learning
📰
Gate on what the model can't author (my comment section redesigned my trust model)
Redesign your trust model by identifying features with external sources, as seen in a comment section discussion on an email classifier's scoring system
Dev.to AI
📰
Your gradient dies on the way to layer 1 (and how to save it)
Learn how to address the vanishing gradient problem in deep neural networks and improve training efficiency
Dev.to · Devanshu Biswas
📰
AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier
Learn how to implement AdaBoost from scratch and understand how it combines weak models to create a strong classifier
Dev.to · Devanshu Biswas
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →