Data Meets Machine Learning - Marc Millstone
Key Takeaways
The video discusses the integration of data and machine learning using tools like Metaflow, DBT, and Metalflow, with a focus on supply chain management and fraud detection in the trucking industry.
Full Transcript
[Applause] Hello everybody. Um, thank you all for helping me come to talk to you all tonight. Thanks for VA for putting up with me the past couple years and we've worked together on a lot of this talk. Well, I'm excited to share with you all. So before I get into what Flexport does and how we do machine learning, data and AI, I just want to talk about my team a little bit about my background. Okay, so I own all core platforms at Flexport. data, machine learning, AI, streaming, edge, everything. It's pretty cool, pretty big. And Metaflow is sort of a foundational part for linking the parts together. And that's what we're going to talk about today. So, a little bit about me. Today's be a little less technical of a talk. We're going to be talking about or structures, how you have to think end to end about the data flow in your systems to be able to make any real insights possible. And I got to this place I guess a little differently. So my background is I'm a researcher. Uh I did work in high performance computing, large optimization, deep learning a long time ago. Uh I was a really bad researcher. I didn't like writing papers. I just like solving problems. Things were hard back then. We had to be across the whole stack from the networking to the compiler. We manage all the comput ourselves. I remember in a fit of desperation I like compiled msos from C on a from Berkeley just to try to get something that was easier than managing all these nodes. Around the same time so fast forward about 10 years uh in 2016 the same time that Metalflow was being sort of built in Netflix I was at the Allen Institute for Artificial Intelligence. We were were doing and we still are doing cutting edge research and deep learning and at the time we were building what became one of the first large language models. It's called Elmo. We were looking around and say how do we help train these systems? Our researchers need to do reproducible research. They do it easy and GPUs are really complicated. And so we built you know one of the first true like training platforms. All we did is training. If you squint your eyes a lot, it looks almost ex metaflow. The only difference is is that you had a Python first uh UI whereas we focus a little bit more on containers. It turns out that you looked to spin this out as a company back then. No one thought machine learning training was uh important. Uh I guess look look around today, right? And finally, and we'll talk about this in a minute, when I first joined Convoy a few years ago, which you know was inquired over to Flexport, I learned that I was owning a team that had a unique structure. I own both the data and machine learning platform teams. As I've talked to all of our partners of VA to other companies, this is a little different. That meant that we saw some of our problems a little differently. We saw similarities where, you know, other people did not. And so let's get into it. So Flexport Flexport owns endto-end supply chain all-in-one platform. We get things from the factory floor all the way to your customer door. And it's pretty awesome. Convoy is within Flexport and it's a marketplace for trucking. We're world class at how we take and work with smaller truck companies, two three trucks each and then in aggregate can make them as powerful as the largest. Today's talks I focus on the convoy side is just a little more familiar with we have across our platform multiple ML and AI systems. We have deep learning based cost prediction models. We have relevant scheduling, all the things you expect in our space because we work with these smaller carriers. Fraud, risk, and compliance is critical. We've invested a lot into this space and um well, I'll give some demos in a minute finally. And then finally, the classic analytic space. However, in our analytic space, we're getting more and more pushes to be faster. We can't move at the speed of data pipelines. I won't be talking about today, but happy to talk about it later as well. So, fraud is one of the biggest things in trucking. People want to go steal stuff and it's really hard, right? We have to prevent it. So, this is a weird chart because there's two axes on the left and right, but you know, these are sort of the industry events. You know, fraud stolen trucks per quarter, the red lines convoy. We've invested a lot into this fraud space because we have to. These solutions mix first party data, third party data, machine learning models, deep learning models and network analysis. Okay. Um, as you see back there, there was a large fraud ring uh many years ago and we were actually able to get out ahead of it because we started seeing the behavioral and network effects within our carrier base. Then we're able to work with our partners and other logistics companies to go solve it. So now we get to a little more abstract stuff. Um what I hear here is basically a list of teams that have to work together to truly understand end machine learning. These teams don't involve my teams. These aren't the infrastructure people. These are the builders personas. So my question here is like I for people here you know how many orgs do these tubes sit in in your organization? How many different groups? I think at ours like it's like I I own three of the four infrastructure teams that support this which I think is pretty unique but they're all over the place right some are in finance some are in integrations group some are in the BI team some science team it's all over they all have different backgrounds different histories and traditionally haven't really worked together and so they'll use different words for the same thing and have different compatible solutions for the same type of but they don't really talk about it. So, you know, of all the fake laws, Conway's law is one of the ones I'll, you know, I'll go with today, your systems match the or structure they're building in, right? So, if the data team, the machine learning team have never talked, their solutions are going to be different, right? How do they work together? When you see this, what you'll then see is that there's sort of mutually compatible technologies. And I just threw Kofka in the middle because it's always in the middle of every data platform infrastructure space. So let's just put it there. Okay, somebody has to own it. So looking on the left and looking on the right, you know what you'll see is like on the data side, they've sort of have Airflow and Dagster and Snowflake and Spark and DBT. They are really good at doing data orchestration. They're pretty easy to use. On the right you'll see you know metalflow, cubeflow, starflow. There's the whole real time stuff with fe and tecton etc etc and they're pretty easy to use if you're a scientist. When you think about the data side though the data team can use the right side but the right side is really hard to use the left side. It's hard to iterate quickly if you're using one of the classic data orchestrators. So we use metalflow to bridge these domains and we're going to talk about how we do this. On the left you have airflow dax or snowflake spark. On the right you have metalflow other things. Let's just we're going to read metalflow. Okay we want to have a system where builders work in this technology they want to have. So our data team works on dbt. Our science team works in metalflow and they all work together. The way we do this is first we have to get dbt to work in meta flow. Okay. Um it's a little bit work to get that done. We're show we do it. We have one flow spec uh for every simple DAG we have that has a branch and we have deployment time configuration. Uh we really work closely with the team to help make this work. We've integrated metrics into the thing. We have some DBT specific cards because you know we want to see what works, what doesn't and you know we want YAML based configuration. Let's look at an example here. Okay. Um this code is useful. It's not going to compile on your computer though. Okay. I want it to fit on a screen. First things we want is we want a decorator so that we can actually say hey for this DAG we want it to run on a schedule or on an event or when another DAG finishes. So we built some custom uh decorators for this. Okay. Then it actually turns out it's pretty simple. This is the whole flow for a generic DBT compiler or for a DAG runner. Uh there's some code here for the XTB DAG flow. How that works is you go you run dbt you get the manifest you then parse that little thing to turn into a data pipeline but we could effectively take for every project turn into a DAG here's the flow steps that go they get created sort of automatically and then it goes and here's how they run our builders are you know they're analysts you know we want them to think in terms of text are easy to understand uh this is example of the three DAGs needed to run to support the uh fraud cases I had. Bag name, here's the owner, how is it triggered, and then you tell us which models you want to run. There's a selection function. So, there's like a little dbt macro that you can write that says, "Hey, this is where it goes. We just pull that all out for you. Here's a DBT card we built. uh if we have UI people want to partner with us, we'd love to make this prettier, more useful. But you know, see, look, it runs all the stuff. It tells the success. You have the runtime. You see all the different tests it runs. You make sure they pass. If they don't pass, they'll send an alert off with them integrated metrics off the data dog and uh then pager duty. And our builders don't really even know where it runs. They just have to go look at the UI if they care. events are critical to both these flows and then the unification of data and machine learning. Um I think we've all been in systems where you have what I call dueling cron jobs. Uh this runs at like the data runs at 6 a.m. and we know it on average takes 48 minutes because ours run to 7 a.m. So that then the model trains but then daylight savings time happens and then one system updates and the other doesn't or maybe something fails and you're training on stale data. If you've been there and if you know, you know, right? Oh, yeah. Exactly. So, this is how we use events. Every flow emits events. The DBT and ML models in react to them. And then the next step is is we trigger into the service platform team. When the machine learning model finishes training, we send another event. We then pick it up, have a lambda, calls out the CircleCI, does some things, and then we're deploying the model either the replacement model or possibly as a variant. Um, you don't always want to deploy a new model. You want to run mult. So like examples of the code again how this works. We uh first go and get the data. So here this is example where we go get some 3P data. Go get on a schedule. When this finishes, we kick off all the DAGs. All the data finishes. When all the data finishes, we then kick off the training model. The train model says, "Hey, is this a new model or is this a variant?" Launches that to the deploy step. And then we have a step that we own, which is the actual deploy to production step that manages all the interactions for the deploy CI/CD pipeline. So for the last little talk, I been talking a lot about the data warehouse, but I want to talk about moving outside of it because there's a world bigger than the data warehouse here, right? Data ingestion. There's lots of tools out there. They go and they get data from a source. uh strike payment balances, page your duty, Jira, you want to do analytics on models on them. You want to go get data from a third party API you need to use to build your model. Uh in our case, maybe freight waves data we use for pricing information or CAB which is the central analysis bureau I think for carrier risk crashes carrier information. Most of the time these are a completely different team and again when that data breaks that is slow there's a problem our data pipelines are now running with the wrong data and our machine learning models running with the wrong data and that's not very cool it turns out we can actually just pretty easily shift this into metalflow so you know the first example I have is green screens is another uh wonderful partner of ours They do uh real-time truck pricing data. Um so we can use that to start training a lot of our models. Uh again this is simplified here but you know in this sort of step we go and we go get a bunch of shipments we care about from the past. We then go out to the API go figure out what the prices were in the past. Write it back into Snowflake. Now our model can kick off and start running. It's a fairly simple behavior. And what I'm we've been playing with recently for the past month is uh how we can actually make it very easy to use pi airbite for all the thirdparty systems where they have uh open source connectors. You all don't know airpite is a really awesome open source product. It's um alternatives are I think stitch and fiverr and things like that. They're very low code ways to say give me data from Jira and then write it to this other place. uh they've been wonderful and the way they built their system is that all their connectors are just code. You don't have to run any infrastructure for them. This is perfect for metalflow. So we built u an air bite manager. Our builders with a little bit of code can say hey airte actually they're not calling airy. It's just using the code itself. Uh I want to go get data from Jira. Here's my configuration. Here's the team I want. Here's the stream I want. Then we can just go execute it for you. Again, the data is getting to the same place in Snowflake as it was before. But now, because it's getting there in the same system, we can be reactive. We can understand when things break. And again, if you work on a data team, the amount of time you spend saying, why is my data not right? Why is this model not training well? You have to trace the world all the way upstream to find it is crucial. Which then leads me to talk about the last thing. Now, Metalflow is in the story here because we just use it to manage some repetitive jobs and tasks and things like that. But I think it's important to understand the end to end data to machine learning story. So, logistics is like a really complex business. Our data is not very big, right? It moves pretty quickly, but like even for trucks on the fastest data, like every five minutes is probably good enough with a few exceptions, but in trucking, I remember like the second week I was at convoy, it was confusing. We had six different copies or versions of uh fuel price. I like why do we have five different versions of fuel price? And it turns out that they're all exactly correct in one very specific domain and they're completely useless in some other domain. And we've had like two COE's when people mix the wrong one because when they're in the data warehouse again, they go use the one they think they're supposed to use. They don't know where it comes from and so they forget. So we've been working on taking this lineage and there's a lot of tools for this uh and extending outside of the data warehouse. So here's a version of the uh these are the DAGs that all power the uh compliance and fraud risk work. Okay. Uh what we're doing is we're using open metadata. Pretty simple open source tool. It has a really great API. So let's what can we do with this information? So at the very end here I zoomed in and these are the final three tables. These are the ones that we use either build models or power analytics. And then a whole mess of stuff there which I'm going to zoom into in a minute. Okay. What I want to show is that on the far left we saw extent outside of the data warehouse. You can see here now that we're getting this data from a third party source. Um this one picture out date. I haven't labeled this yet but we can say hey this comes from the CAB thing. It comes from this metaflow data integration. This is where you go get it. This is how you know where it comes from. And the other big thing we just sort of have to accept in this world is that most data is stolen from production systems databases. Hopefully by letting the engineering teams know that we're stealing their data and breaking the first rule of like encapsulation, right? CDC is a thing. We'll never get rid of it. So we extend the lineage to the Kafka stream. In our case, we're using Repa. Then we send from the repand stream to the Postgres database. So now our builders when they're saying why is my data not working they can click a link over here and they can go chase it all the way upstream to all the different sources to the production teams they have to go talk to to see what changed. This sounds small. It sounds simple but it actually solves one of the hardest problems we have to face uh from my point of view which is who owns this data? Where did it come from? What does it mean? finding that out is actually just takes time. Um so streaming data is also foundational to our system. We have like I I say you know we have to move to speed the business not the data pipelines of the world. A year ago a carrier pulls up. Uh we give them a score on their trip. Uh it's probably really good score. Uh but they don't find out till 12 hours later. And then sometimes it's a bad score, but we know it's a bad score already, right? Because we can see the production systems, they don't know it's a bad score yet. And then they can't book their next load. And like what's going on with that? Using things like Red Panda, Rising Wave, Iceberg, we're moving a lot of that logic outside of the data warehouse. And this is a completely different talk we can talk about tonight or next year. But here again, metalflow is in the middle because we're using it to solve problems of data model drift. If we have when you think of streaming data, you still want the batch compute as well. There are times you want to change something, you want to do like a big back computer across everything. So you want to do it in snowflake, you still need it. So with metalflow, we can say take this dbt model and apply it to snowflake. Then also make sure that the version that's applied in rising wave is the same. It is again just foundational to how we make the teams work together. And with that said, one minute left. Thank you all very much. Questions?
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: ML Pipelines
View skill →
🎓
Tutor Explanation
DeepCamp AI