Building a GenAI Ready ML Platform with Metaflow at Autodesk
Key Takeaways
Autodesk's Machine Learning Platform utilizes Metaflow as its primary foundation for managed training infrastructure, integrating with various tools such as AWS Batch, SageMaker Studio, and Docker for scalable and reproducible workflows. The platform leverages Metaflow's capabilities for experiment tracking, security hardening, and customizable interfaces to streamline the machine learning pipeline.
Full Transcript
so um yeah as I mentioned you know our first speaker for today is Riley Riley joins us from Vancouver Canada I mentioned that you know we're seeing like a more diverse distributed geographical presence which is great he's currently a senior software engineer in the machine learning platform at Autodesk he's worked at few other companies before also kind of sort of in the mlop space he was a senior machine learning engineer at Coro a data scientist and an mlops engineer at sis go uh and then currently he's part of the Autodesk machine learning platform team they use metaflow as the primary foundation for their managed training infrastructure so yeah we' love to know learn more about you know metaflow use cases uh in the Autodesk machine learning platform and anything else that you would like to share uh thank you so much Riley for joining us and over to you oh cool thanks R okay cool um so yeah as Stree mentioned um I'm part of like a founding core um Engineers building um an ammo platform at Autodesk from scratch um it's kind of a unified um platform amp stands for Autodesk machine learning platform um so we evaluated a lot of different orchestration tools but landed on metaflow as we like that it could be used um for a variety of different applications um like data compute orchestration versioning so you got a lot of value in a sense um and we were looking uh for an orchestration tool that would allow our users to construct um machine learning pipelines uh chaining together different tasks in the end to end ml workflow all the way from data preparation to model training and evaluation um at Autodesk we're kind of boxed into using AWS as well um and that's another reason um that we chose um meta flow in that it kind of U makes use of a lot of different AWS manag Services um and given that it provides a nice rapper for running um AWS batch for scaling out our workflows so that users can leverage this um function to simplify the process of building and training models um we were also really elated to discover that a reproducible experiments are the bread and butter of metaflow um especially since we conversion pretty much everything um including flow runs data snapshots artifacts and it keeps track of all the flows and experiments so um this was important in terms of data lineage for us um and then a nice benefit as well is that metl could be leveraged as an experiment tracker um and we're sort of using it uh for that um use case um so again there is tremendous value there but I think the biggest Factor was um human human Centric ux uh one of our biggest worries would be building tooling that wouldn't be adopted or used by our users um especially since um some of them are using there's a bit of tech debt in our organization so some users already have their own bespoke Emma platform so they was going um for them to adopt a our approach it would need to have really strong ux um so that they feel more motivated to transition to our platform than their own bespoke MMO platforms um so this is a highle overview of how we've integrated metaflow into our managed IDE um and we're using stagemaker Studio for our IDE we call it am studio and it access the productivity suite for developing and training models um in order to access Studio users interact with a UI which in turn spins up their own personal Studio instances um once they are logged into the studio notebook they will have access to um various tools um metap is one of those tools uh users can leverage to create and orchestrate ml pipelines run reproducible experiments perform training a scale and then they'll be able to monitor um their workflows uh through the UI um additionally users will be able to bring in their own data uh which both studio and metlow um will have access to import um so this is just a very quick like we're really early in our um we're really really early in our uh um training infrastructure we recently just rolled out to 50 users so we quickly had to put together just a pretty basic UI for now but it does the job um basically it authenticates and authorizes um the data scientists um and allows them to spin up their own personal Studio video instances so each team is assigned to a um different project domain um depending on their team and then um the login is controlled via SSO and then they'll be directed to the studio launcher page um upon authenticating um and then we made available a Custom Security hardened um image that has everything this the user needs to run metap jobs from the studio notebooks um so when they create a notebook they're prompted with a menu where they then select uh the metaflow kernel image and um a startup script and then this kernel has metaflow and Mamba baked in and then it sets metaflow home so that it points to the metap configuration file and then all the user needs to do is um do percent percent bash and then run their metap flow from within the notebook cell and then here Riley is it okay or should we wait for at the end uh sure so so couple of questions actually then so one is just the the UI that you mentioned about like like a barebones UI just to authenticate a user is that something that you guys have to build or is it possible to build that in sagemaker studio uh we built it separately from sagemaker Studio so what it does is it's like connected to an endpoint and the endpoint creates a pre URL and then the pre-sign URL launches the studio inance I see I see okay okay okay and then secondly like so sagemaker Studio would be running in an AWS account so when the user eventually gets to let's say the notebooks those notebooks would be running in the same AWS account in which Sage maker Studio was so that's where the back end instances would run or back that's where like I'm guessing AWS batch uh instances get created later and stuff like that right so it's one account they don't have to worry about uh like you know figuring out how to if it's a different account you log in separately into that account do the same thing on that site is that right uh almost right so um would it be each team actually so an our our organization handles multi- tendency a little differently than um conventional it's kind of an orthodox but every team has their own AWS account so we provision um studio and meta flow into each of the different stakeholder teams accounts um I see and then from there they will uh it'll provision um access to Studio from within their specific targeted production account and then if that account also will have metaflow um I see back in so it's like n accounts per team got it okay sounds good oh cool so um yeah so this is the studio interface that has a launcher button um that basically launches the metl UI directly um so studio is quite customizable um so we can create a simple life cycle configuration in studio and that configures the Jupiter server proxy and engine X so that users can view um the metaflow UI from the studio notebook um also uh when the user uh oops when the user runs a flow there's a link provided to the flow run in the UI so they can navigate to it directly um that's the UI uh so this is our current metaflow infrastructure using AWS manage um services so firstly users execute all metaflow um runs from stagemaker Studio so Studio serves as the point of entry in a sense for access to the metaflow metadata service endpoint um we took care of security hardening every component and a lot of the components are leveraging inner sourced um security Harden terraform modules um Docker the docker images used for the UI metadata service batch default image and Studio kernel are based on in-house images provided by our security team and then there's a patching pipeline uh running on a regular Cadence that refreshes and patches these images as well to ensure no security vulnerabilities detected by orca um and then for orchestration on a schedule we are leveraging um Step functions so any failed executions on step functions are sent to a slack channel for alerting and then finally um we lean on ad of dispatch as the compute layer for scaling flows and um again the Amis used for these ec2 instances provision through batch are all using um security Harden Amis and um yeah this uh infrastructure was vetted by our security team we had to go through a formal approval from security through an architectural Security review um so that was like the bulk of uh the work just making sure every component was um that like security hardened um this is like a really simple thing but actually got a lot of rave reviews from our users and that we compiled a lot of different examples of using metaflow and a metap demos repo that our users can peruse to explore the capabilities of metap it's something really simple but has received um a lot of positive feedback so we've curated like examples of fine-tuning and llm using deep speed um fully shed data parallel distributed data parallel within metap we have other examples too like um Auto tuning sparman tracking uh with our managed tensor board instance um parameter efficient fine tun tuning using hugging face um hugging Face's PF Library um so chances are you'll find like a use case or users will discover a use case that somewhat resembles what they want to do and then they will have like um a guiding template that they can follow to make it easier when they build their own um custom flows uh and then we awesome oh thank you I I just said the examp depository looks awesome because it is one of the more common things where people are like oh how do I get started with a specific use case in mind so if there is an existing example either you can just run it or you can templae it or use it as a template and then figure out this is pretty cool yeah exactly um and then we've also created a cookie cutter template that users can use to easily create their own custom projects structured with metaflow it enables consistent and strict naming conventions of flows projects and tags so the user basically initializes a repository with some um project values and then um it produces a templated repository with um templated flow code um and then these are the values that the user enters to initialize their Metal Flow compatible project it will keep their flow standardized um but the cookie cutter template does more than provide structure for a data science project it also plays um a huge role in how users run their flows in production um using SE functions um so this is still relatively immature and we'll likely see further iterations in the future but uh we at least have like a basic working giops pattern for production workflows um and uh here's how it works so uh basically users do uh modotto evaluation and then they choose the metlow run that produces the best uh configuration and then they tag the the their designated run with um uh ready for prod ready the prod ready tag and then um the user commits their code um from Studio including um the metap flow uh pushes to the repo and then creates a PR in our Enterprise GitHub and then the push event is detected in um the GitHub repository the pr triggers testing for the branch in Jenkins and then it runs unit tests uh for the individual modules with um Pi test optionally runs the flow itself and then Jenkins invokes um a Lambda function byas Spiner and then the Lambda retrieves the metapo config from Secret manager runs the flow and step functions um once the pr testing SL review is successful um then it um you merge to Main and then the merging to main trigger is the Jenkin pipeline performs the unit testing for flow validation um again Jenkins invokes Lambda Vice sper Lambda pulls in the metap config from Secret manager um and then the Lambda uses the metaphil uh client API to Ping the metadata endpoint and checks that the corresponding flow has um the appropriate tag and then AWS Lambda uses the metaflow Nam space prod the global Nam space Lambda pulls in the code artifact from the flow run and then also uses the G hash to download the git repo and then it Compares that the code is the same between the two and then um lastly Lambda compiles the flow and then maps to stem the step function orchestrator um and then it adds the GI hash as a tag in the designated flow run after it's run the um after it invokes um the Set uh the set function um so sorry one quick question so in this case when you say say metaflow flow is deployed in production it is deployed as an AWS bad job it is deployed um as a step function or compiled as a step function okay okay so like python flow. py-- production that function create yeah okay um and then this is the cookie cutter that initializes um the project so that the user can interact with the cicd pipeline and push their flow into production note the unit test note the jins file and um there's a run step function shell script so basically um the user modularizes their code as internal libraries in the SRC folder then Imports them in the flow file directly and then they write unit tests against those internal modules in the test uh folder and then user can either push to a branch in which it triggers an experimental deployment on step functions or um create a or create like a PR mer and then it'll Mage merge to the main branch and then it'll trigger a production deployment and then um we have a Lambda in place to monitor failed and step function execution so if a production flow fails it'll trigger a slack channel uh alert and um we've been so yeah uh that is our orchestrator process very simple um it'll be refined I'm sure as we get more feedback um now on to distributed training um so we've been making heavy use of metapo parallel decorator uh which runs aw spatch multi node parallel jobs to run distri trading um we found it quite straightforward to initiate a gang scheduled cluster you simply need to specify the number of nodes and the parallel argument and then wrap the subsequent step in a parallel decorator um so we have battle tested AWS batch multi node distributor training using a variety of different um Computing a distributed computing framework such as hugging face um accelerate P torch lightning um deep speed and um also tensor full distributed and we have curated examples of each of these in our metap demos rep and our users are currently playing around with some of those demos um and then we monitor the GPU CPU utilization of our distributed training jobs using um the metap guei um the custom uh GPU profile or decorator is helpful with this um uh and then the guey shows the execution of the distribut training job for the worker nodes and the head node we also um we also provide out of the box cloudwatch dashboards as well that show the GPU um CPU utilization um if the user opts not to use the GPU profile decorator um and then we are also in the midst of experimenting with um tensor board profiler to also display these system metrics on um our managed tensor board instance that we have plugged into Studio similar to the metl UI uh and then from the monitor ing guy users can click into each node for logging and then they can view the um progress bar by inspecting the control node logs and then uh we created several Bas cues each with a different um instance family and users can select the bash Q in combination with their requested number of gpus CPUs and other resources um upon which metaflow does uh will select the best instance type for the training job and then we also support um F cues for spot instances as well um and uh so one of our data science research teams is um really heavily invested in using Ray to train their model so we uh this was at the time a little bit of a blocker because we needed to figure out a way to integrate Ray with our manage training infrastructure powered by metaflow especially since we already decided to go with metlow um so we had to find a way to Easter transition also into adopting metlow so one thing I learned was that Ray was simply just a framework um and Ray doesn't necessarily compete with metap they are um very much complimentary of each other so actually the more apt comparison would be raise VM launcher versus aw batch so uh what I was curious to see is if we could run a ray cluster using AWS batch uh multi-node parallel jobs and since Ray is quite extensible we discovered that it was in fact possible to use metap to orchestrate the creation of the ray cluster so um batch handles the infrastructure setting up the drivers and workers while Ray optimizes Hardware usage um so we use um existing abstractions from the parallel decorator and then teamed up with outer bounds to introduce um a ray parallel decorator and then this Ray parallel decorator can essentially decorate um a step and then subsequently set up the necessary Hardware using a dispatch multinode and then users just need to insert their Ray code in the step and then during execution metap starts the transient Ray cluster runs the ray application and then subsequently shuts down the cluster upon um task completion um so we have battle tested the ray parallel decorator and used it to fine-tune a six billion rounder model as well as other variety of other applications and use cases um using dpeed and Ray Trin um the fine tuning job completed in about 50 minutes um just using 16 A10 um GPU nodes and the ray logs are displayed in the metap UI so users can monitor their Ray application seamlessly there are some caveats though um with this integration so it doesn't support heterogeneous clusters mainly because ad batch multi node doesn't support um heterogeneous um So currently all node groups in a multide parallel job have to use the same instance Ty um it doesn't yet support specifications of several different instance types and Avail availability zones in the ray cluster autoscaling config such that when um spot instances um become unavailable smaller instances are created and uh these are some of our benchmarking results when we collated uh that we cated from running a diff uh distributed find tuning job on metaflow using awatch multi node so um we tested on two to four nodes with each node having um four A10 gpus um so all the tests were using p torch lightning in conjunction with deep speed um activation check pointing was turned on to increase the throughput and then we offload the optimizer to CPU to decrease um GPU memory footprint um so the model was a t53 billion parameter Transformer model um the results show that four nodes with 16 gpus in total um was more efficient and cost uh and cost effective and then um this is our Ray train plus deep speed benchmarking on a different model uh we use the six billion parameter um gptj uh model so that's uh separate testing event and then um we're continuing to enhance our distributed training infrastructure with ad dispatch and that includes setting up um high performance Computing which adus has enabled for batch so there are two facets of HBC so the first is elastic fabric adapter which is a networking feature that's attached to ec2 instances to improve internode communication um so working when working with larger models and enormous amounts of data internode communication um becomes a huge bottleneck um which makes sense because have to communicate with each other and then they have to transmit their data to each other they also have to sync like model updates so EFA will play a huge role in mitigating that bottleneck so now U we have a pattern or a way a way forward to scale up to hundreds or thousands of nodes with low latency inter node communication um so some of the work that we did here was security Harden the Deep learning Ami the aw has that comes with the EA driver installed and then we plug that uh plug this Ami into batch um the Deep learning Ami isn't compatible with batch out of the box so we had to install the ECS agent um and we use chefline Chef to do the P um to do the um installation um and chefline Chef's kind of similar to Packer if you're familiar with that um and then with an a100 GPU you can attach up to four EFA network devices um so given that users will be able to leverage EFA um by running their training jobs with um the with NCC and then we've also open- sourced the integration with EFA in metaflow so that um anyone can use metaflow the meta backat Che Creator and specify the number of EFA devices that they want they want to attach and then the next thing we've implemented is a high performance file system um leveraging FSX for luster and that provides low latency access to um data at with throughput levels of hundreds or even gigabyte hundreds of gigabytes per second um so FSX for luster will be um really useful uh when working with um our terabytes of um training data so that instead of reading it in directly from S3 and saving and saving it to memory or dis space um we can instead use FSX for lusters to sync with S3 and then the user can pull in the data from the file system directly um it's also a parallel file system so it handles simultaneous access from the multiple nodes of the HBC cluster and then um we integrated metlow with FSX for luster um by using um metap mounting host volume feature so that um in the bash bat decorate bat decorator you can specify host volumes and then Target the mount path um so all of this work um HGC Plus metal Ray the metal Ray integration is detailed in um our internal Tech blog um and yeah I think that's pretty much it
Original Description
This talk discusses the Autodesk Machine Learning Platform built with Metaflow. It serves as the cornerstone of its managed training infrastructure. It explores the initial integration stages, highlighting the seamless connection between Metaflow and Sagemaker Studio for training job initiation and Metaflow UI access. It dives deep into the mechanisms behind enabling distributed training in Metaflow, the strategic incorporation of GitOps for efficient workflow orchestration, use of other managed AWS services like FSx filesystem and EFA, along with various other enhancements that strengthen the training framework.
Discover more such stories at slack.outerbounds.co
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 47 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
▶
48
49
50
51
52
53
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI