Hosting Models at Scale
Skills:
LLM Engineering90%
Key Takeaways
The video discusses Metaflow hosting for internal use cases at Netflix, utilizing tools like Metaflow, Flask, and Titus to reduce the gap between data scientists and infra teams. It also covers the use of Open FAS as a serverless framework, GraphQL backend, and async hosting for long-running jobs.
Full Transcript
thank you uh so we have one more talk um you may have heard through some of the previous talks uh speakers referring to metaflow hosting which is not something in open source but you know it's similar to what Bento ml would do that has been described as well uh so shashan will from the Netflix metaflow team as well will describe what metaflow hosting is and how we use it thank you um hi so I'm shashan and we have talked a lot about metaflow hosting in previous talks like aliki stock mentioned how they use a leverage metaflow hosting in order to do predictions for their DSC projects and then we also had the Amber team that is the media ml team talking about how they leverage async hosting for async endpoints for computing Amber features and we are just going to talk about the infrastructure behind metaflow hosting in the stock so the first question is why do we need metaflow hosting and yeah so when you think about machine learning at Netflix the first thought that comes to your mind is recommendations or predicting or finding out which movie a user would want to watch and there is indeed a lot of research or machine learning being used to do the same but this lies directly on the consumer side of inference metaflow hosting on the other hand is used for internal use cases it's based on a python hosting model that is flask framework so it's definitely not meant for Consumer Scale Models the places where metaflow hosting is useful is for internal UI tools like identifying the quality of network or performing machine translation for subtitles stuff like that and this is where metaflow hosting comes into play the other reason why we need metaflow hosting is that there's a lot of difference in what data scientists expect uh expect the code to be and what INF teams need to provide them so there's this nice cartoon or comic about every cloud architecture in Big Tex you have like these cool databases and then the mismanaged services or like the unmanageable services and there's also your good old data leaks which sometimes leak and lead into a data swamp but the point is that as a data scientist you don't want to interact with all of this all you care about is train a machine learning model write some simple code and then when actually you deploy a model you want to do some Quest tracing and be able to debug your model easily on the other hand the infra teams care about some aspects mentioned here such as load balancing scaling and an infr team might also care about like how do we bake the docker images like how do we configure the Gateway so that the end points are rooted correctly and what's the underlying infra for the infr team within Netflix such as are we using kubernetes or are we using ec2 S3 and so on so metaflow hosting kind of tries to reduce this gap between the infr teams and data scientists by providing a simple mechanism for data scientists to De to Define Services which they can then use to deploy the machine learning models or any AR arbitrary rest end points so what is metaflow hosting before we go with that like um I assuming that a lot of folks are familiar with metaflow but maybe some aren't so at a high level a metaflow is just a simple class which represents a dag each function or step is what we call it represents a node in this dag and you can use this dag to execute code in containers or you could also execute it locally and metaflow internally as well as an open source provides a lot of features like you can schedule your flows across or orchestrators like in OSS you can schedule flows in arbo on the other hand we have some internal orchestrators like Maestro within Netflix and then for each of these steps you can also specify the resources at Titus is just uh orchestration platform we have within Netflix which is based on Amazon ec2 as well so the other important thing to note is that each step in your metaflow flow can generate these things known as artifacts and these artifacts can then be used in your deployed models so the way a deployed model can make use of artifact is shown here this is the entirety of code that a data scientist needs to write in order to deploy the service across multiple instances have load balancing as well as proper request tracing and observability metrics so you have the example web service defined which inherits a particular class and you get specify the resources needed as well as the autoscaling params like minimum or maximum instances then you can use this function called inet app to initialize your application with whatever relevant models you need or you could also specify if you need some particular package finally we have this at endpoint decorator where you specify the name of your endpoint followed by accessing the artifact so this artifact could be a machine learning model that you just train via a metaflow flow like shown in the previous slide and in this function you have this request dictionary wherein you can get the Json body for your HTTP request and then run your model on it to get a response and deploying the web service is just as easy as running this command wherein you specify the flow that you want your service to be associated with and the definition of your service metaflow hosting provides a lot of features like this is what I mean by simple definition and simple deployment there's also the notion of dependency management where you can specify the environment name and that basically installs the relevant pip packages or cond packages needed for your service we support Auto scaling which is like scaling according to the number of requests there's versioning traceability and support for ASN requests so this is what a deployed model looks like you have the model deployed across multiple instances at the top when you make a single request to an endpoint you can trace the request via elastic search indexes and we also support tracing via Edgar or Zipkin you have logging via radar wherein you can see how your service is performing with respect to the status codes and finally you have the good old metrics which you can use to find out the status of your service whether it's up or not and this is actually used by US to do the Autos scaling as well like if we see a lot of requests coming in then we have a job or a service which can autoscale your service so moving on to implementation details there are multiple ways of implementing a metaflow hosting application internally but there are pros and quants for each of them so we considered DNS redirection as one of the options but we want to look at all of these options based on these four criterias consistency simply means that when you deploy a new version of your service the redirection should be able to properly redirect to that latest version of service and this is an issue with DNS redirect because it's based on a caching mechanism and if you just deployed a new service the cach would not be updated immediately and if the user makes a request to the latest version of endpoint they might still be rooting the request to an older version the other criteria we are interested in is low maintenance and that's particularly important for us because we are a small team and we do not want to maintain the infrastructure for a lot of services so while proxy server and open fast both support uh lot of these features open fast was the best for us because it's low maintenance because there's another team within Netflix maintaining it for us and then there's some other features like logging tracing and load balancing which is supported by both of these and for those unfamiliar with open FAS open FAS is an open source framework for like serverless functions kind of functions like AWS Lambda but and open source this can be based on any back end so you could have kubernetes as a back end or in our case we use Titus as a backend Titus is also another orchestration platform within Netflix so yeah we chose open fast for our metaflow hosting implementation the metaflow hosting implementation consists of two main aspects one is the control plane and one is the metaflow client or the container so the control plane itself is written and go it's highly performant and it is responsible for creating the instances for a new service and control plane is also the service which Roots a query from the users's machine to a particular microservice or endpoint the control plane also does other fancy stuff like generating a Swagger for the user and it's also responsible for maintaining state of the World by maintaining state of the world we mean that it needs to know which versions exist right now which versions can be deployed or undeployed and which version corresponds to the latest production version on the other hand we have metaflow client and containers which is the the code that the data scientists actually interact with so the metaflow client essentially downloads and packages the users code within a Titus instance and then it fetches the relevant artifacts like I mentioned earlier we then do some user directed initialization and dependency management that is download the relevant packages and finally we serve the HTTP endpoints using flask and fatdog and F Watchdog is also based on go in order for it to be a simple server that is highly performant so going over this quickly like we have open fast Gateway which we fondly call Barkley internally and based on this Gateway we create another service called the control plane that I mentioned in the previous slide and then this control plane interacts with a RDS instance which is a database which stores the state for all our hosting functions and let's go over the deployment path so when a user deploys a new service we first go to the open fast Gateway which then makes a call to the control plane the control plane then fetches the relevant information from the database to find out whether the service has already been deployed and we're deploying a new service or whether we want to create a completely new service after that the control plane returns back to the open fast Gateway which interacts with our orchestration platform form to actually start the services needed on the other hand we have yeah so this is what finally deploys the hosted microservice and you can have multiple instances of the same now going over the query path if read his strings wants to query a particular endpoint within our hosted application the query first goes to the Gateway which again goes to the control plane which uses the DB to root it to the appropriate end point goes back back to the Gateway again and then actually makes the call to the micros service or the hosted front end and finally we get the response back so that's metaflow hosting at uh at a high level and then we continue developing metaflow hosting internally like one of the features that was added recently was async hosting wherein you can have long running jobs so in general metaflow hosting supports a request for a period of 20 minutes but with async hosting we support jobs that can run for as long as 12 hours this is particularly useful for the media ml team because they run inference on movies which are pretty large or long and take a long time to perform to execute and the reason for the 12-hour time limit is because of the sqsq which is what we used in our implementation and this is an example of of how you can use callink to get the response from an e sync request but you can also have a simple call back function to do whatever you want and one of the latest features that we integrated into the metaflow hosting pipeline is adding support for a graphql backend uh why do you want support for graphql backend at a big company like Netflix you can have multiple uh microservices owned by different teams as a client UI engineer who defines or creates front ends you don't want to make calls to like 10 different microservices in order to get the response for making a UI page having a Federated graphql Edge allows you to make one single query and then this Federated Gateway allows you to root the queries to the appropriate microservices get back the responses from each one of them it collates them and then returns a unified response back to the client UI engineer so in order to support a Federated Gateway we recently added support for defining a graph Q backend in metaflow and it's pretty seamless for a user to add this there is literally no change in their service code like the service code Remains the Same except within their endpoint they can Define the name for the graph C endpoint as well as the input type and and output type graphql is a typed language unlike python so you need to Define the types like here and then we do all the associated stuff like deploying it to the Federated Edge and that's it any questions thank you
Original Description
This spring at Netflix HQ in Los Gatos, we hosted an ML and AI mixer that brought together talks, food, drinks, and engaging discussions on the latest in machine learning, infrastructure, LLMs, and foundation models.
This talk was by Shashank Srikanth, Netflix.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: LLM Engineering
View skill →Related Reads
📰
📰
📰
📰
ChronoCast : The Time Series project
Medium · Machine Learning
Gate on what the model can't author (my comment section redesigned my trust model)
Dev.to AI
Your gradient dies on the way to layer 1 (and how to save it)
Dev.to · Devanshu Biswas
AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier
Dev.to · Devanshu Biswas
🎓
Tutor Explanation
DeepCamp AI