Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration

Outerbounds · Intermediate ·🛠️ AI Tools & Apps ·2y ago

Key Takeaways

DTN leverages Metaflow with Kubernetes to build a collaborative Jupyterhub data science platform, utilizing automated pipelines for seamless deployment from Gitlab to Argo-workflows and tracking costs with Kubecost.

Full Transcript

moft Tyler is a a data science lead at uh at dtn a company that has a very interesting tagline it says oper provides operational intelligence for confident decisions I love it uh and of course he's been a very Ardent follower and user of metaflow for several years I believe now so welcome Tyler and over to you thank you Shri uh so as she mentioned I'm a data science platform lead at dtn it's actually been less than a year sh I know it's felt like years but pretty new um so let me know if you can't see this but I think it's working so the title of my talk is increasing velocity with metaflow my goal today is to kind of make this like a demonstration of the workflow of our average data scientist at dtn and how they um you know would go from iterations to actually deploying a production flow with metaflow here's a bit of the agenda today so I'll start with some of the types of things that dtn does and then we'll get into the different aspects of our architecture and how the data scientists use them so dtn primarily focuses on three main areas that is weather Fuel and Agriculture and so on the data science side of things these are some examples of projects that we've built on um products that customers are actively using so this one's called storm impact analytics so essentially we're ingesting weather forecasts and it will export and predict on potential impacts to um infrastructure be that power or roads Etc and what our customers do with this is when they get a pred ition they're better able to allocate resources to send you know more people to fix power lines before the storm hits that kind of thing this meeting is being recorded on the agriculture side of things uh one of our main things we do is crop yield modeling so we'll ingest satellite imagery data um previous Harvest data in order to generate models for the current year on estimated yields for different crops this is an example of a graph that comes out of one of those types of models and then finally fuel demand modeling um this is on our fuel side of things dtn's involved with a lot of the point Sales Systems for um the last leg of fuel transport which is essentially like the tankers to the gas stations and so with that data we're able to predict some of the demands over time time and this is an example of the demands year-over-year um during coid and how it's a little bit lower so that's some examples of what dtn does um on the data science side of things on to how we do it so our main interface with metaflow is through a project called Jupiter Hub and I'll show that in a second but want to essentially what Jupiter Hub is it's a preconfigured computational environment so all we have to do to to get someone set up um on our environment and just enable their email and our SSO and when they log in they have access to cond environments a shared file system to a decently large amount of compute and they're just off the races and so I'll show a little bit of a demo what that looks like so this is an an example of Jupiter Hub and this is actually a production environment I just created a few um example files to show how it works but I think most of us are familiar with Jupiter notebooks and so you can execute python in real time each user has the ability to use about 130 gigs of RAM we had to split that up because it's a overallocated single ec2 instance that's this is running on so we if they were unlimited they would uh crash the server which was never fun but the nice thing about this is that um metaflow is completely set up via environment variables and so you can see here we have a shared environment which is immutable to the user and so when people iterate on jupyterhub itself they'll be using this DS environment and this is useful because occasionally we have to manually modify meta flow for certain things that occur in our environment one of those for example is the kubernetes auto scaler it was like deciding to try and move pods from one node to another to try and um eliminate nodes that were underutilized and naturally that would break your job and made scientist set so um it's nice to be able to kind of push out updates to metaflow so that the users don't have to worry about doing it themselves and we have consistency across all of our users the other thing is that metaflow itself is preconfigured so here's just a few of the metlow environment variables um this isn't all of them because some of them are sensitive with tokens to Argo and that kind of thing but you can see here that we have like the container image and this is because pulling from the public Docker container with large jobs was blowing out the um limits and so we pull from Amazon ECR because we run on Amazon um we also have kubernetes automatically configured so as soon as the user logs in it authenticates to our kubernetes cluster through I am RS so it's really nice and that you just it's like kind of Click button solution to get your data scientists up and running hey quick quick questions here tiger so is Jupiter itself also running on kubernetes or is that a separate setup and and the kubernetes environment for meta flow is a separate setup it could be currently it's not um so right now it's just running on a single large E2 instance and part of that is that was just the existing architecture but the thing is that um it enables us just to use like a single large instance and over allocate it really easily whereas on kubernetes the default is to have each user have their own pod got it got it so one large easy to instance hosts Jupiter Hub users log in to I mean I guess there is a URL for logging in so you have a UI and then you it's authenticated users log in when they log in all users are sharing the the same ec2 instance so all their Jupiter notebooks are running on the same ec2 instance yeah with which has pre-configured meta flow and then when they run their flows the flows actually run on a kubernetes cluster that is Elsewhere on AWS but elsewhere maybe correct it's on a eks cluster on AWS and that's obviously if they use The kubernetes Decorator they can also run flows locally of course yeah nice and if anyone has any other questions feel free to uh to ask them as we go as well all right so I wanted to show um a sample pipeline like I mentioned of how a project starts and then actually gets scheduled on Argo and so we have a cookie cutter repo that data scientists will use to create all of the template files to create a project and within that you can schedule multiple flows in the gitlab CI files and then whenever your flow is pushed to the main branch in gitlab it will automatically deploy itself to Argo so to kind of start showing how that works um most of our iteration with data scientists is done in Jupiter notebooks and then the next step after that once you have a working concept is to create a flow file and so we've all seen metaflow files um with our setup we have it set so that flows are put in the source directory and this is because by default it will include all of the current python files and all subdirectories of python files in the code package and so here I created an example of adding a really simple function to your source and so you can test this locally and then you can also test this on kubernetes when you want to scale up and so this is an example example of adding The kubernetes Decorator and here we can see an example of this running it's worth noting that we also have this set up with the UI so that makes debugging really easy you can just click the button with your link and go and look at the UI and see all the standard error standard outs for your flows once you have your flow set up the way you want and you want to deploy it to prod you go to the gitlab CI file so like in the base of the repo there's a file that is the CI and this is what gitlab pulls up in order to um see what python files are listed to push through the CI let me move this so I can actually access that so we made it so that this pulls from a template repository so again like if we want to change how the template works for all data scientists we just change the template in gitlab and then all of the projects inherit from that template so this is what the template looks like hopefully this is big enough but essentially it runs that U Argo workflows create command we've added a few additional things like it will tag it with the gsha it we also have um a slack web hook on Argo so that if things fail it will notify us in a slack Channel and then this is all triggered to run whenever there's a change to either the python file for your flow or any of the subdirectories within your source files and the goal of this is so that the state of your flow is always represented in the main branch so that when another person comes by to try and like fix your flow if it's broken and the main developer on vacation we know the state of the flow and where all the code is and I'm guessing that gitlab is configured to have the cube config of the kubernetes cluster so when you do python whatever flow. piy Argo workflows create it already is running in an environment that has access to the kubernetes cluster it has access to cube config so that it can connect to that cluster and so on right correct yeah and that's where this like Ci project user comes into play so we have it configured to an IMR that has sufficient privileges to deploy things to kubernetes MH nice and to show an example of this running um so we have a few different like linting Steps here with black rough he lint Etc we have a testing phase and then we have this um deployment phase where it goes and deploys your job to Argo um and so we've also integrated with Cube cost and this was kind of a happen stance that CP cost had a mechanism for this but within CP cost and for contacts coup cost is an open source project that enables you to filter based on different pods or tags and then it pings the AWS cost API to estimate the cost of your running pods and so we install coup cost and there's this concept called allocations and you're able to map allocations to tags and kubernetes objects and because the project tag when you're creating a flow maps to a particular tag within kubernetes we just added that tag and now cop is able to track all of our flows based on our projects so as an example of that this is someone who had a project called first 2023 and you can see here that it tracks like the CPU cost it tracks the ram the persistent volumes Etc it's kind of cool in that it will also show you like increase and decrease of cost over time as well as your efficiency um and so that actually was like kind of the the sum of My Demo it went a little faster than I anticipated but if there's any questions I can dive deeper into any part of this and but thanks for watching questions anyone feel free to either chime in or send it on uh on Zoom chat or actually I'm also keep I should keep an eye on ask metaflow if you have questions you can ask there on slack um I I have lots of questions but I if there if there are people who have more questions feel free to ask them first I have a quick question about ccas does it also track S3 costs or is it mostly the kubernetes cost by by default no because S3 isn't directly associated with a kubernetes object yeah okay um with S3 I mean one of the reasons we did coup cost is because there's not really a native way in AWS cost Explorer to like pull out the granularity of PODS within kubernetes but you should be able to do that um in S3 I a quick question okay uh so maybe I missed that how you managing like in the deployment process how are you managing like if you delete a a workflow or for example or you want to rename it like what's the kind of cleanup process or how do you manage to state with what's deployed uh Argo workflows yeah so there's um a UI for Argo I and so that's where we will go if you go to Argo UI you can log in and see the state of all of your cron jobs and that's where um you can go to edit things so you just question if you well you just quick follow up if you want to kind of delete the deployment you would just go to the UI and delete it there or you can do that there's also like an Argo workflows delete command in the metaflow CLI um we don't have that integrated with our gitlab because it's not super common that people fully delete things but if you wanted to you could delete it with your flow file okay uh it's not related but it might be of interest but um we've also installed a cube cost but actually AWS have an integration with it now that means you can keep more than 15 days worth of data um so yeah might be well is that the one that uh integrates with the S3 bucket that tracks your costs uh I'm not sure I don't think so I think you can add that on but um yeah it came out last year originally you couldn't do it but it came out late last year and now you can um I'm happy to send you a link offline yeah the link to the do be awesome hey Tyler great presentation I have a question uh on Cube cast and and and the Argo deployment piece so one so so are you deploying your metaflow jobs across multiple clusters and then tracking these cost across multiple clusters or are these within a cluster and and second half of the question is like hey how how do you manage to deploy your jobs across multiple clusters and tracking them using Aro as well so can you talk a bit about that yeah so all of our jobs are deployed on a single cluster um and I think that answers the second question as well because you don't deploy a multiple cluster okay I think Cube cost by default is kind of like a premium model um so there's a sass offering and that integrates with multiple clusters but by default CU cost it doesn't have like a integration with multiple clusters so could Tyler question about like the data scientists experience of using kind of sort of this model yeah uh what is what is what is their feedback on this like when like I'm guessing Jupiter Hub is something that many people are familiar with so they are happy to kind of start there but everything following that uh using metaflow using Argo workflows on kubernetes like what is the data scientists take on how that setup is and what they like about it what they may may not like as much um it's kind of like the that acceptance curve where have like the early adopter late adopter side of things so some of our data scientists who enjoy learning new tools and trying new techniques were like really eager to try it out and really enjoyed it and I think one of our most positive feedbacks was he realized like an entire module he had created to track models and version them was like now obsolete because of like the artifact management of metlow he like oh this is so cool I don't have to like do all of this manage myself anymore so that was really cool some of like the less positive is that there's obviously a learning curve to U metaflow as well as kubernetes and lots of like random things that pop up particularly as you scale up um and so those things I think are a little harder for our data scientist because they don't have really context on like why these errors are happening go and then as far as the like production flow of like how we set that up in gitlab we design that hand inand with our data scientists and a few different meetings for like hey like are you okay with this like what do you want to have happen um and that's what eventually led to you know our kilab setup got it and then data scientists users rather any user today can directly run uh what python flow. Pi Argo workflows create and and it'll create the Argo workflow for that flow on kubernetes or the same Could Happen through gitlab right so it's either you created it interactively or through gitlab it's still the same or is there like a difference in like the environments name spaces whatever yeah so we did namespace it so um as we if you look here I we set the namespace to production so this both um puts it in a different metaflow name space which is project U yeah yeah project project concept within metlow but it's also a different name space within kubernetes because we wanted to have different limits so that we wouldn't have someone testing something out and like blow out the capacity of the cluster and then our production jobs couldn't schedule anything I see so when you yeah yeah go ahead go ahead so by default within Jupiter Hub if you do like an Argo workflows create it will create it within the develop El name space and you could like hardcode overwrite it if you wanted to but people don't really and then by default it deploys the prod with think G okay so the production name space has I guess better monitoring better um guard rails around it for better resource utilization or uh limited resource utilization so that it can run a production workload whereas Dev is kind of like you know experimentation oriented is that right I wouldn't really necessarily say it had any different kind of monitoring it's just um within kuber denes like name spaces are pretty isolated from each other and by Design and so by having a different name space enables us to isolate like the compute usage from the development side of the house to the production side of the house but it still uses like the same maniflow UI and slack Integrations and that kind of thing

Original Description

Tyler Potts is a Data Science Platform Lead at DTN. DTN leverages Metaflow with Kubernetes for building a pre-configured and collaborative Jupyterhub data science platform. This setup comprises of automated pipelines that facilitate seamless deployment from Gitlab to Argo-workflows. These pipelines ensure that workflows are source-controlled, schedulable and effortlessly redeployed. Kubecost is used to track the costs of flows that are utilizing the @project decorator. Discover more such stories at slack.outerbounds.co
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 45 of 60

1 Metaflow GUI for monitoring machine learning workflows
Metaflow GUI for monitoring machine learning workflows
Outerbounds
2 Metaflow Cards [no sound]
Metaflow Cards [no sound]
Outerbounds
3 Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
4 Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
5 Metaflow on Kubernetes and Argo Workflows [no sound]
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
6 Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
7 Metaflow Tags: Programmatic Tagging
Metaflow Tags: Programmatic Tagging
Outerbounds
8 Metaflow Tags: Basic Tagging
Metaflow Tags: Basic Tagging
Outerbounds
9 Metaflow Tags: Tags in CI/CD
Metaflow Tags: Tags in CI/CD
Outerbounds
10 Metaflow Tags: Tags and Namespaces
Metaflow Tags: Tags and Namespaces
Outerbounds
11 Metaflow Tags: Tags and Continuous Training
Metaflow Tags: Tags and Continuous Training
Outerbounds
12 Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
13 Fireside Chat #5: Machine Learning + Infrastructure for Humans
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
14 Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
15 Metaflow on Azure
Metaflow on Azure
Outerbounds
16 Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
17 ML engineering vs traditional software engineering: similarities and differences
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
18 Why data scientists love and hate notebooks: velocity and validation
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
19 What even is a 10x ML engineer?
What even is a 10x ML engineer?
Outerbounds
20 The 4 main tasks in the production ML lifecycle
The 4 main tasks in the production ML lifecycle
Outerbounds
21 Is the premise of data-centric AI flawed?
Is the premise of data-centric AI flawed?
Outerbounds
22 The 3 factors that Determine the success of ML projects
The 3 factors that Determine the success of ML projects
Outerbounds
23 Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
24 Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
25 Metaflow on GCP
Metaflow on GCP
Outerbounds
26 Fireside Chat #8: Navigating the Full Stack of Machine Learning
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
27 How to Build a Full-Stack Recommender System
How to Build a Full-Stack Recommender System
Outerbounds
28 Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
29 Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
30 Fireside chat #9:  Language Processing: From Prototype to Production
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
31 How to build end-to-end recommender systems at reasonable scale
How to build end-to-end recommender systems at reasonable scale
Outerbounds
32 Full-Stack Machine Learning with Metaflow on CoRise
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
33 Natural Language Processing meets MLOps
Natural Language Processing meets MLOps
Outerbounds
34 Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
35 What even are Large Language Models?
What even are Large Language Models?
Outerbounds
36 How to get started with LLMs today
How to get started with LLMs today
Outerbounds
37 LLMs in production
LLMs in production
Outerbounds
38 Accessing secrets securely in Metaflow [no audio]
Accessing secrets securely in Metaflow [no audio]
Outerbounds
39 Fireside Chat #11: The Open-Source Modern Data Stack
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
40 Fireside chat #12: Kubernetes for Data Scientists
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
41 Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
42 Fireside chat #13: Supply Chain Security in Machine Learning
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
43 Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
44 Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
46 From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
47 Building a GenAI Ready ML Platform with Metaflow at Autodesk
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
48 Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
49 Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
50 Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
51 The Past, Present, and Future of Generative AI
The Past, Present, and Future of Generative AI
Outerbounds
52 Building Production Systems with Generative AI, Machine Learning, and Data
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
53 A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
54 Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
55 Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
56 Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
57 Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
58 Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
59 LLMs in Practice: A Guide to Recent Trends and Techniques
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
60 Metaflow for distributed high-performance computing and large-scale AI training
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds

DTN's data science platform leverages Metaflow with Kubernetes to facilitate collaboration and automated pipelines, enabling seamless deployment and cost tracking. This setup allows data scientists to focus on building models and driving business decisions. By using Kubecost, DTN can track the costs of flows and optimize resource utilization.

Key Takeaways
  1. Set up Metaflow with Kubernetes
  2. Configure Jupyterhub for collaboration
  3. Automate pipelines using Argo-workflows
  4. Deploy workflows from Gitlab
  5. Track costs with Kubecost
  6. Optimize resource utilization
💡 Automating pipelines and tracking costs can significantly improve the efficiency and effectiveness of data science collaboration and workflow deployment.

Related AI Lessons

Up next
AI in Care - Katie Furey, Pairly.com
The Access Group
Watch →