Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Key Takeaways
The video demonstrates the application of Metaflow for processing telematics data in the insurance sector, specifically for building a large-scale risk estimation platform at Nirvana Insurance, utilizing machine learning models and CI/CD systems.
Full Transcript
uh we are going to hear today from sarak Agarwal from Nirvana Insurance Sak joins us from bangaluru India he's a software engineer for uh working at Nirvana insurance for a year and a half now before joining uh Nirvana sarak did two internships with them uh he also and then interned at Samsung Electronics in bangaluru as well as in South Korea uh he Sak has a btech degree from IIT Delhi um and for those of you not from India I is India's topmost engineering institution uh people like sarak sinin from have been there and some of us have aspired to be there but we couldn't make it uh and uh so sarak welcome to metaflow officers and over to you uh thanks Shri for the introduction so I'll just uh maybe quickly share my screen before I start yes please uh let me know if you can see it yes uh great great so uh just before I start actually I would like to take a small minute um to really acknowledge and appreciate all the effort outerbound team is putting into this project and you know engaging with the community Through Your communication channels uh and just here I like you guys have always been there in this SL whenever like I or anyone else has posted queries feedbacks uh even so just really want to call that out uh and I know you now have a cloud offering as well but still the force on the open source and the self boosted version has been running strong so really appreciate you for that it's really impressive and exciting so thanks for all of that so so with that um I would like to start with the presentation so we can start with maybe a quick introduction on like what is the problem we are solving here at Nana and how does metaflow figure into the picture here so to start here um basically an introduction about who we are so we are an insurance company and we insure these commercial autofield businesses in the United States uh so in the in the insurance so uh we use telematics data to better understand the risk of the insured now the telematics data basically means sensor data coming from the vehicle such as GPS accelerometer safety events those sort of data now in the insurance industry uh the bottom line is determined by something called loss ratio uh so loss ratio is basically uh claims paid over the premiums received so it's a it's a measure of your bottom line and the only way to optimize this bottom line is to more accurately predict risk that a potential insured uh poses to you so the more accurately you are able to segment that risk the more accurately you are able to uh derive this ratio lower and in turn this means your bottom line is better so that is where the data science comes into the picture because to estimate the risk um you need to train some models on it um statistically you need to this is a statistical exercise estimating the risk figuring out like how much uh would be be paying in claims for this insur potentially as compared to the premiums we receive from them of course it's a statistical exercise so that is where the data science figures in now technically speaking um each customer in our segment has like these 50 to 100 trucks in their Fleet and uh for like an accurate risk prediction we need at least two years of driving history uh two years of driving history means like two years of sensor and all of that now this figures out to be um roughly like tens of GBS in volume because tetric is sampled at a very high frequency at 10 to 30 per minute so that's roughly 2 to 5 Seconds um once every 2 to 5 Seconds so it's actually pretty huge volume data per Fleet um and we then need to process this data to generate featur such as aggregate miles driven per state or per geographical region uh region could be like uh our defined and then maybe radius of operation like how far does the operation of this Fleet expand maybe things like how frequently do they take sharp turns harsh breaks how frequently they speed so uh like you get the idea basically there could be a large number of features that we may be interested in and all of these features need to be computed after processing this data which turns out to be expensive uh because of the volume uh finally we have the ml model which converts these features into a risk PO for classifying these plit so classification in terms of uh classification in terms of like loss ratio basically so so the key challenges we are really solving for here is uh basically Automation and scalability so automation because we need like very fast turnaround times from the application State uh you can hear me right yes we can I I just was about to type a message like is it if you have questions should I should I like speak about it right away or do you want to take questions at the end feel please feel free to interrupt me at anything yeah I that's perfect so one quick quick question about he won't stop once he starts no I'm kidding come on like don't scare him no so one question about uh for the for calculating the risk score like you brought about all these different features that you want to base the model on and those features are based on data the telematics data if a new company is joining you isn't there like a cold start problem where I'm a new company I want to join you but then I may not have the telematics data with me just yet so how will I give you right right right so so conveniently United States has this legal mandate that requires uh trucking companies at least to compulsorily have telematic devices attached to them that poar problem solves itself so but then that data is so if I'm a if I'm a a trucking company and I have a fleet of trucks with me I have this device attached to all my trucks and is the data recorded and available with me or is it in some like I mean I don't know S3 bucket like good question good question so there are thirdparty companies which we call tsps so telemetric service providers uh so some of the common names are for example samsara go motive so these companies basically sell these devices these sensor devices that these trucking companies buy and as a part of their offering they also offer a a basically a s offering for third parties like Nirvana to pull that data on on behalf of the customer so the customer gives us an O consent or some other form of consent that we can then show to their tsp that hey we have received permission to pull your data and that's how we pull data wow okay I was not aware of this this is pretty cool yeah yeah so so in terms of like challenges basically Automation and scalability as I was saying is key for us automation because uh the turnaround times need to be optimized so from the day the customer gives us the consent to pull their data to the day we are able to generate a quote for them a prelim a preliminary quote at least and scalability is simply because the data volume per customer is huge so unless we account for that we are going to run into very large uh turnaround times both of these are sort of interrelated now because of the nature of the work these pip turned out to be pretty complex and multistep uh because there are a lot of features and some features are like uh some features require some intermediate features to be generated first and so on so because so this makes orchestration inly difficult uh if we are not using any frame so we cannot just like write a python notebook and get away with it we need some sort of like a well- defined orchestration framework uh that allows us to declare these requirement um and we also uh for scalability we also need to parallelize the computation as much as possible fortunately we see like multiple opportunities here so you could paralyze the data processing by vehicle Maybe by time Windows those sort of things so uh so so that is where an orchestration framework like meta flow fits into the picture um but there are a lot of orchestration Frameworks um so why we chose metaflow in particular uh so we were looking for basically something which should have a zero or low dependency on engineering or data scientist should not have to Circle back U with the engineer obtain like permissions or depend on engineering for anything either deployments management writing so we wanted to make it as self Serv as possible um writing pipeline should be simple we don't want to spend too much time on boarding data scientist on like our own bspoke framework so we wanted it to be simple uh do on board from engineering point it should have low maintenance overhead but we so this typically implies we choose something like a managed uh deployment uh a cloud offering but we sort of preferred self hosted because it allows more flexibility in terms of pricing and what you can do with it so so metaflow sort of take that box that it's easy to deploy and also self hosted and finally we were looking for something serverless so we only pay for the compu time we use um in in metaflow the only only fixed cost I think is the database the RDS that we run for and everything else is completely elastic so we like that so so so this was basically our tenets for choosing a framework and meta and like among the Frameworks at least we studied metaflow sort of was the only one that really stood out in this regard so we decided to give it a try and it has now been like one and a half to two years and it has been working great so far uh so so in quick sorry one one one question here about so was this mostly driven these requirements kind of sort of driven mostly from you sort of the organization and the engineering point of view or was it more like you know data scientists kind of sort of gave this as one of the things they were looking for where they said you know what you know we like python we want to be in the world of python and that's it like we don't want to be dealing with too many other things it should be simple it should be straightforward I guess self hosted or not maybe they may or may not care I don't know um but but like what right yeah so a good question so it was sort of a mixture of both so for example the first requirement was a strong ask from data scientist because they've seen in their past companies uh these things go wrong so they knew this time they to ask for the straight away um things like self fored in serverless it was more of a engineering and companywide decision that we want to stick with that but but we were flexible on that fortunately we found metaflow which allowed us that luxury but yeah so that was how many data scientists are involved in this maybe today I understand the decision when it was taken they might have been fewer but how many data scientists are involved today that actually use kind of metaflow here so even today we only have three data scientists in the company and the data platform team is actually also just two people so it's a small team really sure sure and that's why that's why like the low overhead and serverless these things are important to us because we don't have the human resources to invest heavily at this point of time yeah Mak so this is like what our Cloud stack sort of looks like so we are using metaflow over so all these components are by the way glued together by metaflow and we are using our deployment is over AWS so earlier we were using fargate as the compute uh cluster but then we over time we switch this year we switched to ec2 as a compete cluster the reason was that U so our Docker image is quite heavy it's like three or four GBS so farget takes like two two and a half minutes to just start to just download the image in start which is expensive and so so that was the primary reason to switch to ec2 so we can so we can basically cach images in the instance so so at least the subsequent steps are fast even if the first step still takes two minutes to load and also E2 allows us to leverage spot instances um farget also allows SP instances but2 also has reserved instances uh for better pricing and if you can nail down the reservation ratio uh sorry utilization ratio by that I mean if your instan is four course how many coures you really utilizing for the computer then it turns out ac2 is also cheaper than fargate but you need like a really high utilization for that I think over 60 or 70% we just touching that right now nice finally I think if you need gpus then you would need ec2 but right now we are not really into that directory but we just feature grouping at this point so um what this diagram does not deal is a few missing pieces so this was about compute like how theoretically compute would work but if you look at that mlops as a process uh in an organization from start to end then there are quite a few missing pieces that we need to take care of so I would like to spend the rest of the presentation talking about these and how we solve internally for some of them and how some of them are still unsolved so the UI dasboard was F first thing we needed some sort of a a landing page or a web page where data scientists would monitor their runs um fortunately met FL already provides a UI service for that which we are using and it has been working I think it works great so far we don't really have um requirements beyond what the service already solves for then we also needed some sort of an API based um triggering or basically a cross language API for meta So Meta API is right now python only and for example the Engineering Services are in goang so they needed some way to trigger pipelines at least monitor their status cancel them and those sort of things um deployments was the other aspect so when you have multiple flows deployments got get slightly dangled and especially if you're looking for continuous deployments then you need some sort of Automation and reliability there rather than just using the CLI a Telemetry so metaflow by default does provide the UI dashboard does provide some monitoring like looking at the time it takes for each step to run but like more stuff like looking at the memory requirements of the step how much memory the step takes or just analyzing the time taken by step over some other dimension such as input size or the deployment version those sort of things is something we want to look at as well uh these are more the five six and seven these are more of a wish list from metaflow and we are like working on these internally as well so maybe some sort of a caching support So once a flow has run and it has computed some artifacts if you rerun the flow why do we need to recompute those artifacts again assuming we only use meta flows artifact store and not some third party store how can we do it today the subflows is also an important for modelization I will talk about it why we need it and finally garbage cleaning old artifacts and data so our metlow bucket is already hundreds of terabytes and we don't want it to run more than that so maybe some sort of garbage cleaning yeah so I'll I'll talk about it one by one so first the API so nothing out of the box I think in the core so we decided to write our own grpc service in Python that exposes this API as grpc for the teams to use it so this part is simple enough right you just write a grpc rer no but that's if if it's okay I want to touch base on this so what does this grpc rapper do because if I if there is a data scientist data scientist says okay the instructions are like meta flow as a flow spec you create a new class inherit the or extend the flow or implement the flow spec object the interface and then there you have your start what as many steps in between and that's that's your flow so where does grpc come into picture in this yeah so gfpc needs to be aware of of all the flows that are currently present in the system so so yeah yeah maybe I should have touched upon that so our GPC during boot it basically scans all the step functions that are available so it knows about them and it is sorry no no go ahead go ahead complete thought so the next thing is grp this grpc should also be aware about the input scheme of the flow the input that a flow needs to run right I think that's what you were talking about how does it know about that I'm I'm still curious like what does the so is the is there a service and I believe you mentioned something about written in goang or something that triggers a flow is that what it is doing so this is a python Service uh okay any any service in Goan can call make a grpc call to this service and this service internally resolves it into a metaflow API call the metaflow python API call got it got it got okay okay so this is I mean if I like drill it down even further this is a server process that is running correct presenting an API other services send grpc uh payloads or whatever to this service this service will process those payloads and convert it into metaflow uh whatever inputs and metaflow parameters and start a metaflow flow like trigger the actual flow run the flow runs and I guess the result is returned back all the way or something like that is that correct correct correct correct so starting so metap does not provide a programmatic to start up flow so we directly start this use the AWS API to start the step F yes so yeah already I made a note of this so let's I'll let you complete all of this but in one uh one option or one thing to discuss also would be uh if you use kubernetes uh or if you explored using kubernetes because metaflow offers the ability to trigger a flow based on an event uh and the event could be a web hook based HTTP event so you make HTTP request to a particular endpoint and that can trigger a flow uh and that but that's today based on kubernetes and you know some other components uh involved in this but we can talk about that later for now this is this requirement is is very uh and sort of Fairly common that I want to be able to invoke a flow programmatically whenever I want yeah yeah sure s go ahead I to extend to extend that question that Shri just asked um do you have use cases where uh you want to run your Flows at a Cadence or is it always that there is some Upstream uh event triggering that happens that you then want the flows to trigger on right right we we don't we don't need sched flows yeah we don't need schedule flows yeah only the Upstream events that trigger flows got it got it okay thank you yeah and even if we need schedule flows the weight works is that we have a schedu in Goan so that basically generates an upstream event to it so it's it's automatically solved for the schedule fls makes sense thank you yeah so so so I will touch more upon this uh but this is really not simple enough as this because uh as soon as you have an API you get contracts so now you have a responsibility to some other team to enforce it right your data scientist can no longer like make arbitrary changes to flows input schema introducing like new required inputs or changing type of some input because that is going to break some other service so as soon as you have API you have this contract so you need some way to enforce these contracts in in in a formal way in a more responsible way um so what this means essentially Is That Flow inputs and the grpc message U that our API ler is receiving they should be translatable and this translation should always be possible it should not be possible that a flow input is changed and now grpc message can no longer be translated to that flow input so so our solution for this is that um so so by the way before I touch upon this one way could be that you ask data scientist to also write Proto of messages along with the flow so now they have more responsibility they ensure that they right prot files the flow files and keep those in sync um but we wanted to keep the the overhead low so what we decided was that we will ask data scientist to just write a one thing which is yl specification for flows inut so this is like this image shows how it would look like so you would describe the inputs to the flow the source code of the flow and what uh whether the parameter is required or not what's the type that you expect from it uh we also introduce some custom types so here at the bottom you will see a type of S3 path so this is a custom type that is that is internal so it basically it just means S3 pointer essentially um so they write the CML specification and then we write a utility to generate Proto files as well as python stops both from this specification alone so that way both of these automatically remain in sync and we can have basically checks and CI to verify that this constraint always upholds so you cannot uh basically edit the Proto filers or your python file manually without breaking the C yeah and as long as sorry has it happened that someone made a change to the flow but forgot to change the yl file so from from the starting we sort of added the CI check so by design it has never happened yeah it was not yeah yeah yeah but it also I mean it's the same as like you know it's there are two sources of Truth uh and keeping them in sync requires some sort of like you know process level uh you know safeguards or guard rails in place to make sure that oh you I see this input has whatever this flow has these inputs and how does it compare so makes sense but good that you have yaml because yaml is certainly simpler to do that having to kind of sort of write Proto Buffs and then run the compiler and all that stuff yeah yeah so that's why we went with EML and we can our utility can generate Proto from it correct um and one more important thing that the U does it it expressly forbids making breaking changes to the schema so you cannot add a new required parameter after the initial deployment you cannot um you cannot change the type arbitrarily um those sort of things deleting a parameter is okay because we don't consider it as a breaking change um so um so but there are going to be breaking changes right you cannot really disallow breaking changes Al together so that's where we introduce versioning in our system so flows can have versions so the first version may have some set of parameters the next version may have a different set of parameters so what versioning allows us is to do gradual deployments so you can deploy both the versions at the same time and then slowly switch your Upstream code from version zero to version one as the as as that code changes so that allows basically zero downtime deployments even with breaking ches um so so I am now going to touch upon versioning and deployment so this whole discussion is now going to lead into how we deploy flows here because it's interrelated so we needed to allow for multiple versions to permit gradual roll out and for that we just use this project and Branch name feature of metlow to so so one version could look like ds. v.low name in know version could like ds. v1f flow name so and in Step function basically the step function name would look like that um from the deployment setup uh that result WR a couple of requirements so so this reliability is more of a good to have so deploy if and only if necessary so if you have multiple flows you don't want to manually look at which flows need to be deployed you just want to say deploy and should automatically figure out which flows need redeployment based on whether the source code has changed or not uh versioning uh so you should allow multiple versions to coexist at the same time um it should be safe so you should not allow deployment if there's a breaking change in the input schema and it should be like by Design so even by accident it should not be possible to deploy a flow in such a manner that it permits breaking changes and it should be uh correct in the sense that there is some notion of expected State and there is some notion of uh the present state of the system and if there is some uh diff between those you should correct it so if I expect some flow to be deployed but it's not deployed then this tool should take care of it um they should it should change the current state of the system to match the expected state of the system so so so we wrote this tool and we call it Tera flow because uh it's sort of inspired by this open source tool called terraform so terraform I I'm sure some of you might know about it it's like for managing Cloud infrastructure infrastructure as a code um so how it works is basically that we maintain the spec file the same spec file that I showed earlier the flows. ml file and it describes the spected state of the system like all the flows and all their versions along with their input schema that are expected and uh it maintains a state file in S3 which describes the last applied state of the system so um so similar to how terraform does it so once you deploy it will note down which all things it deployed and it will persistent in necess for for bookkeeping now the diff between the state file and the spec file represents unapplied set of changes so if you make a to spec your local spec file which is checked into the repository you are basically making some declarations or some changes to the expected state of the system so this diff between the expected state of the system and the current state of system as proed by the state file this represents the unapplied changes that you need to account for so first we first we verify that these unaccounted changes should be legal so illegal changes we right away block the pr your PR is not going to be merged if it contains illegal changes like adding a new required input parameter to a FL um those sort of things and uh once like you are ready the deployment is simply calculating this step applying it to AWS through the metaflow CLI so metaflow step functions deploy CLI and then again writing back to S3 your applied State optionally we can also detect drifts here so what can happen is the S3 State file and the step function State can drift but we are not doing that right now uh assuming like no one changes that functions manually but one question here so you mentioned about rejecting diff that contain illegal changes so illegal changes would be one where you added let's say the new required parameter so you added the new required parameter but your yaml doesn't have that that's an illegal change right you edit it to yl um so already deployed yeah no illegal changes you cannot add a new required input parameter because that is a breaking change to the schema so if you add a new required input parameter and deploy uh some existing customer is going to break because there's they are not uh they're not really sending that required input parameter currently but but then how will you make that change you also need to make a change in the yaml so that the subsequent Proto change also happens and so that subsequent so the way we do it so the way we do it is you introduce a new version of your flow in the EML EML allows multiple version right right so you a new version of the flow yeah and then so for some time two versions of the flow would be deployed and then you gradually switch your consumers to the new version once that is completed you can delete the old version if you want got it got it yeah Mak sense yeah so so that's uh how we do it uh so this is like how Tera flow looks like so so in the first line it says this flow will be deleted we don't uh allow redeployment deleted flows and in the second line it says this has been modified and in particular the code Source has changed then maybe some new optional parameter is added so adding an optional parameter is completely fine and in some other case uh I made a required parameter to optional that is also fine it's not breaking change and then lastly some new flow is going to be deployed so these are like all the changes that can happen and in in one go it will apply all these changes to uh the AWS and also update the S3 file um yeah so basically summarizing on versioning so no breaking changes guaranteed by Design which means API contracts are always up ah H um and if you really wanted to do some breaking change you we do a gradual roll out like add a new version deploy it switch consumers and then delete the older version uh so so that's really about the deployments and the versioning part I am now moving to Telemetry but if anyone has questions cool go ahead so as I was saying UI dashboard does show like time taken and it's actually a very very nice timeline view which I which we like um but we need more metrics so like memory Network usage we we need to see where the flow is getting bottl neck is it the network bandwidth especially if you're running on ec2 instances right the same instance could be running multiple steps at the same time and now you don't know if the bottleneck is the CPU for the instance Network B from that instance and we also need to like observe the distribution metric along maybe like some user defined variables it could be the deployment version the size of the input data set we wanted to study those so what we did was we basically created our own user space decorator which reps a default atate step decorator with some tary rappers so we were we are using open tary for it but the but the problem really is that it needs some onboarding of data scientists on using open tement so we don't really see much usage of this feature beyond the default metrics system level metrics that we provide by why do you need data scientists to because they have to install open Telemetry on their laptops and stuff is that oh they have to actually export metrics correct correct yeah of course yeah yeah so that exporting is not really um yeah they don't really like that uh at least the interface that we have come up with that's not really that's not really ideal maybe for them so yeah so the default metrix like memory and all we get and we tag it with things like flow names tap name deployment version those sort of things but we need more than that right now yeah so caching um and subflow is the other thing I wanted to touch upon so I know that metaflow already has this resume function which is gold but we like to reserve it for debugging sessions because of like it's inent designer properties for example the flow input immutability when you are assuming you have to assume the same flow input um and it's really more intended for failed flows you you can like resume a failed flow after fixing some changes or those sort of things in in in maybe like other organization use cases differ but in our use cases where there are like very large feature generation pipelines which are processing huge amounts of data we and and we we for the same customer we want to run pipeline again after one month so we have two years of data plus now one additional month of data what we want to do is only uh only process that additional one month of data and use the cash results for the last two years of data so that's sort of where our requirements comes from so we observe that a slight change in flow input in this case one additional month of data does not not require a need not require a full run right most of the steps could benefit from reuse of data from the previous run if possible so so we have this uh unimplemented proposal to solve this problem in completely in user space so without like making any native changes to metlow so what we were thinking is we could introduce this post table such as like FL name run ID task ID and this cash key so so what the steps the metaflow step would then look like is it could first quickly compute the cash from the input uh maybe it's like the Sha of inputs or maybe it's like the current deployment version those sort of things and then check this postest table for the cash key and if if if basically the cash key is found just copy the artifacts from so the table will already tell you which task computed that artifact so just copy the artifacts from the task using the metaflow python API and that's it uh you just yield from the step if not found then you do the business as usual and then write that stable so that was our sort of small idea for adding caching so the basically the way to the the requirement essentially is if the source code did not change if the data did not change then reuse cach results else correct actually go ahead with the computation yeah right yeah right interesting yeah but I mean I mean I don't know you tell me like in isn't um a bunch of data science work kind of sort of begins with like you know randomization initialize like a neural network with random numbers and then change so something or the other might change if you were to run the step every time but is it okay or it depends on your use case I guess maybe for training uh maybe for training starts with a random se but like generation pipelines in particular it's pretty deterministic got it got it okay okay yeah Fair yeah and training tends to be so even in training basically there there are like two stages so there's a pre-processing stage and then there is the actual training so what we have seen happening with our data scientist is that they wrote some flow which is which is like a very large flow it takes six to eight hours to finish to completely train but uh just during the training they found some null value or zero value and it crashed the flow so now they have to restart the flow it it would have been ideal if they so right now they use a resume functionality to do it um but if if it was also cached then that could that could have also solved the problem so the preing steps could have just C complete and now the training can just re got it yeah so um in in principle I think this may solve it we have not really implemented to say um with confidence but because we are doing it in user space and not in the metlow code directly it makes certain things difficult so for one like copying artifact involves a download right so you can avoid the theupload because I'm aware that metaflow uses content addressable storage so theupload it won't do again but the download it will do to copy the uh copy the article so if metaflow were to natively supported um metaflow can just avoid the download as well it just needs to man man insert a pointer into the task metadata file to point to the same mhm so for large artifacts this may be significant and and I guess if you are looking into caching you are talking about large artifacts have you explored using metlow extensions uh not so metf extensions is a is a mechanism for extending metaflow functionality now I'm just trying to think about whether metaflow extensions would work in this case or not um I'm not exactly sure so even I need to go and and of look at that but one option to explore like since you mentioned that okay making changes to metaflow is one thing but it will come with its own kind of like you know yeah yeah it it it can be complicated because you need to make sure everything that exists today works correctly and so on and you know we don't break anything exactly metaflow extensions is the mechanism to extend metaflow uh and add more functionality and I'm I think this might be something I'll also take make this as a note to kind of explore whether this can be something where extensions can help and I can I know that's a good call out actually I have not also explored the Met extensions that much I think I can also have a look what is the functionality of there and how much control does it allow us yeah thank option yeah the other option could just be like imagine meta having like an at cach decorator where it can figure out the logic that you mentioned you can also always write your own decorator which is basically a function that you import have in your own library and ask people to import the library and start using that it takes people to kind of adopt that and so on which is still extensible but but yeah I mean it's you still kind of used especially the this problem of like you know pointer to the artifact that probably you still have to write the code for it yourself yeah yeah so native support would avoid that but that's okay that's not like a really really big problem right now uh the other problem is um so it's sort of a consistency issue so because the right to post so metaflow rights artifacts laily after the task after the last line of the task completes right but the last line of the task is going to be the right to post table which is actually supposed to happen after the right writing artifacts succeeded so that has sort of switched over and uh if like due to some reason the right to artifact S3 fails then we may be looking at inconsistent state so so that's like a slight slight cor case that I think maybe unavoidable U because you're doing it in user space and like the integration with UI dashboard that this step was catch just giving a hint to data scientist that this was cached and maybe also hyperlinking to original it's more of a good to have then requirement one quick way for you to do this today by the way is to use tags so metaflow has tags and then metaflow has a way to access the currently running uh run and task using this current variable if you have seen it right so you can use current and then you can tag that particular task as you know cach and then maybe the name of the step you can put whatever tag is basically a free form key value pair and on the metaflow UI on the right hand side it shows all tags so you can see the tags for a particular flow so you could just add that could be one way to solve for this today itself like of course there could be other improvements but so we we use tags but we use them at a flow level are the tags also available at a task level uh no I I was thinking you can still tag it at the flow level but use the name of the step or the task ID and say that okay cach task ID equals name of the task or something like that I see I see still flow level but you can say given a flow you can say that okay this flow out of this in this flow there were let's say 20 tasks but I see like names of four tasks as cached so I know that these tasks were cached so on and something like that I see yeah yeah that's certainly one way to do it yeah but large number of so if there is like a fan out step then that may be slightly problematic and we have fan out steps on like fan out on vehicles or things got it got it faf yeah so this is I have a a dumb question here um we sort of talking about adding cashes to metaflow right um you sort of gave an example some while back about you know your data scientist having flows that let's say run for seven hours and and whatnot and um if if steps pil in the middle then you want to sort of skip over them out of curiosity what is the like do you have any rough tent of failure rates for your steps that might have driven you to sort of consider caching uh run caching executions from previous runs um of those steps failure meta flow or because of external dependencies like do do you have a sense of that right so so good questions failure is not a motivating example for caching the motivating example for caching is to enable incremental processing of data so so basically let's say you want to process two years of data two years of driving history of a customer but now you want to do it once it becomes your customer you want to do this processing every month as well so so so when a new month of data comes it's easier for you to say to again say process last two years of driving history but for your pipeline to automatically figure out that I have already processed um 23 out of 24 months so I only really need to process the last one month really so that is what we need caching really for the failure was uh is failure is another example but as I said that is actually also solved from the resume functionality the data science already used currently so it's not really that much of a motivator there got it makes sense thank you yeah so incrementality is What U we are really Keen for here so related to caching the other topic is really upflows and this is the synchronous one I know we have landed a support for asynchronous one but that is for aggro workflows I think yeah so I I I'll just motivate why we need subflows um in the organization so if if so imagine like your flows if if they get quite large let's say 20 steps then having subflows is nice if only to modularize right so now modularize so one argument I've seen in select channels in discussions is that you can also do modation today using just split your code into python functions and then step called the python functions which is fine but I argue that there are actually two types of modularization that we are looking for the code time modularization sure this does it but you we are also looking to modularize deployments so so imagine like there are so I I'll just jump to the next SL maybe so imagine like different teams on different feature generation flows one team can own the feature a another can own feature B so we want both the teams to version their feuture generation flows independently separately and then the third team um can basically write should be able to write a parent flow that uses both these sub flows now in the in the current approach um sure like the code can be spread but the deploy as long as like you are your flow is one unit of deployment you are going to deploy the flow a and flow B at the same time with your parent flow and now and now if like the the team that manages flow a releases releases a new version of their flow they also need to deploy the all the consumers so which which is sort of what we want to avoid here so basically also allowing deploy time modularization is what motivates the need for subflows uh right and it it it seems very hard to do without native subflow support as flow is a single unit of deployment so decentralizing deployments is difficult this is a very good point about deployment actually uh at least I have never come across subflows so another term by the way that you might have seen in the slack channels it's called flow composition that's what people have been kind of using so flow composition has always been talked about in term in at least the ones that I've have seen in two uh in two different use cases one is uh that they genuinely want flow a to trigger flow B uh because you know whatever you did a and then you want B um and then the second one was uh had to do with like what maybe like what you were saying like code Simplicity different teams can write different flows and so on but now if you add cicd systems to this and then now you're talking about kind of sort of contracts because of automated um automated triggering of flows now you have to kind of it's It's essentially you can't you have a split brain problem if you only do one deploy one flow which has a change when the other one has not and so on and so forth so that's a good point it's a good point cool yeah and and I think there are a lot of bonus benefits uh if we do sow for example Telemetry can then be done at the sow levels um instead of like so step may be two fine grain so so in the current metaflow architecture there is flow there is Step cor and then there is step right so in in like large flows there is a logical group of steps that you want to study together so subflows allows you the flexibility so there's a flow there is a subflow and then there's a step got it I have a question here so in in this concept of subflows that you're um that you're referring the the parent flow let's say I think in your example you call it the parent flow flow a and Flow as the child flow like I don't know child flow is the right term but um do these flow a and flow B uh sort of logically always live next to the parent flow or do they can they like live in separate uh housed in separate repositories and are version controlled independently of the parent flow uh the reason I asked that question is I'm wondering if you could possibly use like you know if if flow and flow B could be versions independently of the parent flow then can you use can you think of them as independent libraries and then you could version them as cond uh libraries or piie libraries and whatnot so that's kind of the the the thing in my head but I'm not not I don't I understand if flow a and flow B are living right next to the parent flow in this case so so what good question in terms of codee they don't have to really next to each other I think you are right that they can be separate repositories maybe third party packages but in terms of deployments I think they have to live next to each other um because ultimately when you trigger parent flow you want flow a and flow B to get executed so someone has to uh deploy a flow a and flow b as well so maybe metaflow does it automatically when you say deploy parent flow maybe metaflow figures out that this flow has dependency on the these two flows and it also automatically deploys them maybe that's one solution um or like right right right yeah I I didn't thinky got it yeah if you if you want to like modularize deployment so right now the philosophy is that deployment is just one step function or like one unit so then this question does not even arise so for example the flow composition avoids this problem Al together because once you compose the flow it's a single unit of deployment you just deployed in one go MH MH yeah makes sense makes sense thank you yeah good question thanks I think that's really all the slides I had uh this time hey this is incredible man really really good like I think you guys have done great job of kind of taking metaflow and kind of looking at how you want to make it work for a system that you want to build and solves your problems and then being able to systematically take step by-step actions to make that happen so whether it is API driven uh caching observability you know step by step really good work
Original Description
In this video, the ML platform team at Nirvana explains the application of Metaflow for processing extensive telematics data within the insurance sector. Data scientists utilize Metaflow to develop machine learning models for risk estimation incorporating metrics such as average truck speed, sharp turn counts, and journey durations and distances. Other application teams can invoke these flows based on API contracts provided by the data scientists. CI/CD systems can then be put into place, flows can be versioned and all teams can continue to iterate quickly.
Discover more such stories at slack.outerbounds.co
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Playlist UU5h8Ji6Lm1RyAZopnCpDq7Q · Outerbounds · 49 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
▶
50
51
52
53
54
55
56
57
58
59
60
Metaflow GUI for monitoring machine learning workflows
Outerbounds
Metaflow Cards [no sound]
Outerbounds
Fireside chat #1: How to Produce Sustainable Business Value with Machine Learning
Outerbounds
Fireside chat #2: MadeWithML.com -- Teaching Practical Machine Learning
Outerbounds
Metaflow on Kubernetes and Argo Workflows [no sound]
Outerbounds
Fireside chat #3: Reasonable Scale Machine Learning -- You're not Google and it's totally OK
Outerbounds
Metaflow Tags: Programmatic Tagging
Outerbounds
Metaflow Tags: Basic Tagging
Outerbounds
Metaflow Tags: Tags in CI/CD
Outerbounds
Metaflow Tags: Tags and Namespaces
Outerbounds
Metaflow Tags: Tags and Continuous Training
Outerbounds
Fireside chat #4: Machine Learning and User Experience -- Building ML Products for People
Outerbounds
Fireside Chat #5: Machine Learning + Infrastructure for Humans
Outerbounds
Metaflow Sandbox Demo: Free Data Science Infrastructure In the Browser
Outerbounds
Metaflow on Azure
Outerbounds
Fireside Chat #6: Operationalizing ML -- Patterns and Pain Points from MLOps Practitioners
Outerbounds
ML engineering vs traditional software engineering: similarities and differences
Outerbounds
Why data scientists love and hate notebooks: velocity and validation
Outerbounds
What even is a 10x ML engineer?
Outerbounds
The 4 main tasks in the production ML lifecycle
Outerbounds
Is the premise of data-centric AI flawed?
Outerbounds
The 3 factors that Determine the success of ML projects
Outerbounds
Fireside Chat #7: How to Build an Enterprise Machine Learning Platform from Scratch
Outerbounds
Run Metaflow on any cloud: Google Cloud, Azure, or AWS [no sound]
Outerbounds
Metaflow on GCP
Outerbounds
Fireside Chat #8: Navigating the Full Stack of Machine Learning
Outerbounds
How to Build a Full-Stack Recommender System
Outerbounds
Modernize your Airflow deployments with Metaflow - zero-cost migration [no sound]
Outerbounds
Easy Airflow DAGs for ML and data science with Metaflow [no sound]
Outerbounds
Fireside chat #9: Language Processing: From Prototype to Production
Outerbounds
How to build end-to-end recommender systems at reasonable scale
Outerbounds
Full-Stack Machine Learning with Metaflow on CoRise
Outerbounds
Natural Language Processing meets MLOps
Outerbounds
Fireside Chat #10: Large Language Models: Beyond Proofs of Concept
Outerbounds
What even are Large Language Models?
Outerbounds
How to get started with LLMs today
Outerbounds
LLMs in production
Outerbounds
Accessing secrets securely in Metaflow [no audio]
Outerbounds
Fireside Chat #11: The Open-Source Modern Data Stack
Outerbounds
Fireside chat #12: Kubernetes for Data Scientists
Outerbounds
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Outerbounds
Fireside chat #13: Supply Chain Security in Machine Learning
Outerbounds
Quick Delivery, Quicker ML: DeliveryHero's Metaflow Story
Outerbounds
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Outerbounds
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Outerbounds
From Kitchen to Doorstep: Optimizing Data Science Velocity at Deliveroo
Outerbounds
Building a GenAI Ready ML Platform with Metaflow at Autodesk
Outerbounds
Media Transcoding for 10 Million users and beyond with Metaflow at Epignosis
Outerbounds
Telematics with Metaflow: How Nirvana Insurance built a large-scale Risk Estimation platform
Outerbounds
Fireside chat #14: Generative AI and Machine Learning for Film, TV, and Gaming
Outerbounds
The Past, Present, and Future of Generative AI
Outerbounds
Building Production Systems with Generative AI, Machine Learning, and Data
Outerbounds
A Custom Fine-Tuned LLM in Action (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 5)
Outerbounds
Building Live Production Systems with RAG (LLMs & RAG: An Interactive Guided Tour Part 4)
Outerbounds
Better Relevancy with RAG (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 3)
Outerbounds
Working with OSS LLMs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 2)
Outerbounds
Hitting OpenAI and Other Vendor APIs (LLMs, RAG, and Fine-Tuning: An Interactive Guided Tour Part 1)
Outerbounds
Production Systems with Generative AI (LLMs, RAG, & Fine-Tuning: An Interactive Guided Tour Part 0)
Outerbounds
LLMs in Practice: A Guide to Recent Trends and Techniques
Outerbounds
Metaflow for distributed high-performance computing and large-scale AI training
Outerbounds
More on: ML Pipelines
View skill →
🎓
Tutor Explanation
DeepCamp AI