When Airflow Meets Kubernetes

Analytics Vidhya · Beginner ·📐 ML Fundamentals ·3y ago

Key Takeaways

The video 'When Airflow Meets Kubernetes' by Analytics Vidhya explores the lifecycle of a machine learning model and how Airflow and Kubernetes can be used to manage end-to-end operations of machine learning models.

Full Transcript

hello and welcome everyone to another session in the data our series we are thrilled to be here with you for a session full of action-packed learning I am part of the data science team at analytics Vidya uh for those who have joined us for the first time let me give a quick brief introduction to the data workstations the the dataver is a series of webinars conducted by analytics with the and laid by top industry experts it is a fun way to understand the concepts of data science from leading players in the data Tech domain and as the name suggests it's one hour dedicated to data we are hopeful that these sessions are going to be a great source of enrichment and value addition for our community members now on to our session today which is on when airflow meets kubernetes an introduction to mlus the life cycle of an machine learning model is really complex there are a number of phases starting from the exploration phase which should go up to deploying the model in production and then further maintaining it each life cycle phase has its own set of Hardware requirements and even information from one phase needs to flow to other phases in this data work we have unmold with us will uh he'll be explaining ml oops from the start covering end-to-end operations of the ml models using the power of airflow and kubernetes uh I hope you are excited to attend this data work with us uh before we kick things off and I hand it all to our speaker quick recap of the housekeeping items we are recording the session and will make the recording available in a few days on our YouTube Channel please use the Q a section for asking any question you might have during the session and we'll be doing our best to answer them as data our progresses or towards end also we'll share a poll about feedback of the session towards the end of session which I request to to kindly feel now on to our speaker in the session of the data work we have anmol Krishnan sachdeva with us who is the hybrid Cloud rtk our key Tech at Google let me give an introduction about him anmol Christian is an international Tech speaker a distinguished guest lecturer a tech panelist and has represented India at several reputed International hackathons he is a deep learning researcher and has about eight Publications in different domains he is an active conference organizer and previously has helped organized some of the most prestigious conferences like Euro python jio python and python machine learning conference he has done MSC in advance Computing from University of Bristol United Kingdom and currently works at Google as a hybrid Cloud architect in the past anmol has spoken at Renault conferences and take forums like cubicon pycon Euro Python and got invited as Swift just at various events he likes innovating keeping in touch with new technological Trends and mentoring people so now over to you anmol the virtual stage is always thanks a lot for the introduction I hope my screen is visible yes you are visible all right so uh yeah so we are here to discuss about when airflow beats kubernetes and its uh quite an interesting topic so we will be having two things at rest through this particular session one will be we'll be having an introduction to Emma Lobster and post that we will be getting into how we can leverage airflow and kubernetes platform for performing mlops right so we will have this as an introductory session uh we are not going to get into uh Advanced details as such okay so it's beginner friendly session is the volume good now yes yeah it's okay sure Okay so I'll skip my introduction because already I've been introduced so thanks uh just a disclaimer before we get started the content and the views represented in this uh Taco session are my own so not uh these are not to be related to any of the organizations or companies I'm associated with right so the flow of the talk is as follows first we will have a look at what ml Ops is essentially and why we require mlops right what's the actual need of mlops we will try to get into understanding of orchestration Frameworks and and needs of orchestration we'll get into the details of airflow following that and then we will have a look at airflow on kubernetes that will be followed by your demo and we will have to send a q a session so let's get started with the understanding of why mlops is needed right so essentially uh the people who are there in the audience uh you may be thinking that okay ml op seems to be machine learning operationalization or it has something to do with devops uh you are right in that context and the thing that that you need to understand here which many people uh kind of skip or don't pay stress on is that a production solution for machine learning requires much more than having a perfect ml code right you have things like configuration you have data collection process you have verification process you have data serving deployment you have the monitoring aspects you have Process Management you have analysis as one of the domains you have feature extraction and all of these things are essentially something which supports your ml model which supports building of your ml model which supports having a proper process for your machine learning pipelines and also supports your models getting deployed in production right so generally we see that when we are dealing with machine learning models we tend to bring up some notebooks in Jupiter or maybe collab and ultimately data scientists are the one who essentially deals with building the notebooks uh but but then we see that many of the notebooks don't go into production right so these notebooks are essentially not scalable these notebooks are not replicatable maybe you're reusable or maybe uh we don't have reproducibility as one of the features so it's very much restricted to the local environment of the data scientist who created the notebooks and when it has to go to production you essentially need someone else's help and then is is the place where data engineering person comes into picture right so you have this data engineer working along with your data scientist and then you have essentially another person coming into picture or another entity coming into picture which is a machine learning engineer so you have your data engineer solving your data ingestion related issues you have your data scientist solving your data modeling specific issues data preparation feature engineering data cleansing algorithm specific issues and then you have something called as machine learning engineer that's another entity that entity actually focuses on having your machine learning models deployed served and even kind of maintained uh going forward right so there are different stakeholders to this process but we generally focus on data scientists or maybe one of these entities so if we are from a platform background we will be focusing on data engineering if we are from data scientists uh background we will be focusing on just data modeling but all of these pieces have to be tied together and all in all all of these when streamlined and put together in a fashion wherein we are receiving continuous feedback for the process that we are following for machine learning gives us a good machine learning Pipeline and streamlining this machine learning pipeline according to standards and governance and compliance requirements gives us something called as machine learning operations right so we will get into that soon but before that let's understand uh further a few points which will help you yeah the screen is shared is is there any issue with the screen no screen is visible okay good yeah uh right so uh we will touch on few more points before we get into the details of ml Ops okay so first thing to understand here is science is geared towards research whereas engineering is geared towards production now what this means is when you are talking about science it's more about exploration it's more about doing research it's more about experimenting with the models or maybe building models having some iterations turn playing around with your data and this is the role which the data scientist has to play but that alone is not sufficient enough for your ml models to be served in production that is where the engineering aspects of deploying serving the model maintaining it and taking it forward or maybe iterating on it is something uh which is required and that is where the machine learning Engineers come into picture then we are striving for operational excellence we don't need to uh kind of focus on just having these models built but we need them to be aligned to our business needs we need them to be uh regularly updating right uh so assume a scenario wherein you are a data scientist or maybe a machine learning engineer and you have been told to go about solving a problem for the next six months time and you go and you start developing your model you start prepping you have done a lot of things yeah I'll answer the questions towards the end so please keep on uh having the questions shared on the chat I'll answer them in the uh dedicated QA right so let's assume a scenario that we were talking about that your business has given you a task uh maybe your manager has given you a task to come up with uh some model maybe prediction model uh maybe uh fraudulent transaction prediction model and you have six months uh for for preparing that particular model now you go you prepare your data you you identify the sources you who do you do the feature engineering you have all the data prep you have some models experimented you have done uh kind of a bit of permutation combination for different models which fit fill uh and then you have passed that to the machine learning engineer they have done their bit of having these models prepared for deployment and you just turn out to your manager and say that okay I am done with this model and already five months have elapsed now your manager tells you that okay uh as of now we see that the data that was provided to you has changed the features that we were working on have changed the requirements and the business alignment has changed right so whatever you did for five months or six months right all of that has gone in vain and you will need to start from scratch just to realize that in the next four months again the same scenario is going to happen likely wherein the data you were working on is now obsolete is not referenceable and you need something new or you may need to have some more feature engineering done or maybe you need to tweak or refine your features and clean your data or maybe have your data modified in accordance to uh have your data modified in accordance to the new business requirements which have been shared right so that is where it calls for operational excellence and we will see in this session how we can achieve that then there may be multiple teams already we have identified that there are different stakeholders like data Engineers who are coming towards the beginning of the cycle as well as helping the data scientists with the platform needs helping the machine learning Engineers with the platform automation Etc and each of these teams uh not necessarily we'll be using the same set of tools so some people will have Affinity towards some uh particular set of tools right you will have teams which are biased because of uh they have experience with the tool or maybe the tool is really handy or maybe they have done a lot of projects using those tools so they have some Affinity or they have some biasedness towards the the tools and you may not have a unified uh tool set to support these requirements right and then you are also trying to build multiple models so there there are multiple teams who are using multiple tools and they essentially are going to build multiple models but all of those models are going to solve some business need and all of those models may be required to solve some particular use case as well but these models are getting solved in sub teams right so that is where you need some operationalization uh in in the framework so that you can have some standardization done you can have some reusability done you can you can reuse the things which were developed earlier or maybe you can reproduce the things uh and and you can have some checkpoints you can have some governance in place uh if at all someone says that I need to check what happened with this particular model or maybe we deployed this model in production and it's a version V1 now how it Compares with the new model which you are coming up with the version V2 so all of those things are essentially uh going to give you a hard time if you don't have operationalization and in in the framework right hello then see that there are I would like to uh just let you know that few people are not able to see your screen so okay would you please like to reshare your screen because a lot of people whosoever is viewing right in the zoom there is a button on the top which is called view have speaker view along with screen ticked on that okay so it's it's a it's called side by side speaker screen view uh that should be solving your issue okay should I continue yes you can okay thanks yeah and we see that sometimes it is very hard to manage models which which have a lot of kind of data right uh there are large backlogs of models that are to be published or maybe they are queued because they need to go in in some particular sequence and and there's no standard way to audit monitor test or or even maintain the models from a future perspective so there's a big question on the life cycle and governance constructs okay uh we just see that according to Google Trends the mlops term has got a lot of traction okay and since 2004 you see that it has grown a lot okay so the interest in this field is growing just because there is need of mlops in the field right so what's required for AIML projects and uh whatever we are going to talk about hold true for even AI projects right so it's not very much restricted to just ml projects we are we are even talking about AI projects here that is why I have mentioned ML and AI projects so mlai projects generally require Cloud native platforms right because there's a lot of compute to be done there is requirement of containerized workload so that you can have the same environment or the same set of configuration persisted across different attempts that you do in different environments across your organization and you have also sometimes requirement of serverless technology or integration with some Cloud Technologies wherein you will be hitting some Cloud functions maybe Pub sub or maybe Lambda functions in AWS whichever you can relate to so serverless technology is also required for having some of the tasks uh kind of delivered right and then you see that there is many a times specialized Hardware which is required and essentially you will see that some of the models are working good with GPU some of the models are working good with CPU and while training the models or while uh kind of uh prepping for the model throughout the ml pipeline you may require different node configuration for each of the tasks that you are performing so data prep if it if it has to be automated will be having its own configuration requirements data cleaning feature preparation and feature engineering will have its own set of requirements you will have it you will have a machine learning model deployment serving Etc have their own deploy deployment requirements for from a node configuration standpoint and then there is a hard requirement of having integration with the data stores with the big data platforms with the tools like high effect spark Hadoop Etc right so all in all we we get a common theme from all of these uh things which have been uh put on the screen and that theme is machine learning and AI are very close to Cloud right they are very much reliant on cloud and I would uh it is it is hard to imagine any AIML pipeline in today's scenario which is not utilizing uh some of the cloud constructs or features or products right so more or less you will have majority of them utilizing the cloud technology at some point of time okay so let's get into what is ml Ops so essentially ml Ops is just the tip of the iceberg right so you see ml Ops is placed at the top here but essentially what comes as uh or maybe what serves as uh pillars to the ml Ops there are three essential domains or pillars one is devops which is the uh you can say bottom most pillar right or the bottom most stack uh item that that you need to focus on if you have devops sorted you will be having a lot of uh you know kind of uh you you will see that your ml Ops thing uh or or you will be better positioned to be successful in your ml Ops Journey right why devops because essentially you are discussing about continuous integration wherein you are having new things getting uh launched or new feature or new code getting introduced or maybe new configuration getting introduced and then that new configuration is essentially uh foreign going to get tested because of continuous integration once that is tested and it has passed all your test cases maybe you have unit testing maybe you have integration testing Behavior driven testing all of those things you can take care of once that testing has been done you are good to have your things deployed or delivered using continuous delivery continuous deployment so devops is really a very strong construct and requirement for achieving ml Ops because of which you will be having most of your automation uh uh kind of requirements handled for cicd then comes the data automation portion wherein you are essentially uh going to have periodic collection of data right you may be running jobs you may be having stream processing for your data you may even have serverless or even driven triggers for your data Gathering collection needs uh you may have some big data features store or you may have some big data jobs for processing your data you may be utilizing data Lake Etc and uh the the most essential part of this which is very much required for ML Ops is from a data operations and data automation point of view you need the data and model versioning to be done and you need some checkpoints to refer to so all of that is something which comes as part of the data automation stack then comes the platform automation stack and platform automation is uh we can say the pillar which is required to have your platform or infrastructure supported uh to enable the machine learning pipelines to machine to enable the machine learning automation so once you have the data flow automated you have achieved the data processing uh kind of uh Nirvana or something you you essentially have this platform wherein you will be required to have different Hardware supported different node types different virtual machines maybe you want to go with kubernetes maybe you want to have some IO done with respect to your storage buckets for data Lake maybe you want to have some production end points generated for your ml pipeline for serving those so all of those things essentially yeah uh yeah you shared a screenshot like you are on the first page or you you are still on the first page now yeah yeah no no no no I'm I'm not on the first page because now yeah it's not visible for everyone okay let me start reassuring what do you see a pyramid yeah yeah yes yes okay okay I'm not sure what happened uh okay thanks thanks for pointing this out and uh okay yeah so so what we missed if if you this so we missed this particular Trend okay uh we missed these bullets which which I was talking about and essentially these Technologies which are required for AIML right and then we are currently talking about the pyramid okay uh so I hope you followed the audio though right uh if not the video uh so then comes the ml Ops which is essentially the process of automating your machine learning using devops methodology so devops is a behavior right devops is a methodology uh likewise you have machine learning operationalization also a methodology or a behavior essentially you are building machine learning models uh which which are following some standard practices and which are leveraging the devops practices and essentially you have uh operational excellence massive scaling uh standardization streamlining of data pipelines uh and again some problem hello sorry for interrupting but uh but it's again we are seeing the blue screen okay not sure why this is happening give me a second please foreign foreign is it okay now yeah now it's we can see the pyramid okay you'll see the screen moving right yes yes perfect okay okay thanks so essentially uh mlops is going to help you with standardization of your machine learning pipelines streamlining your data science workflows or ml workflows right and you are going to have your governance your Project Life Cycle management uh operational excellence all of them uh kind of guaranteed or or sort of addressed right so in short you need four things uh which which will help you with ML Ops right first thing is that you you need to have an ml Ops feedback loop structured in a manner which makes sense and which is very much aligned to technical as well as business requirements right so the feedback loop includes creation and retraining models with reusable ml pipelines uh so you you create a model just once uh and and that's not enough right you will need to have the model recreated because the data will keep on changing your uh requirements will keep on changing um and the term we use in the machine learning uh terminologies is something called as degrading of data or decaying of data right so uh essentially you will have that your data is decaying uh it's not relevant now or maybe in in a few days or few months time or few weeks time it is not relevant so you will need to retrain the models but still you will need to hold the reference to the other models what what the other models perform so you will need some some sort of checkpointing some sort of versioning for your models right and this is where you can have some aspects which will be reused as well while building the models so you need to have this thing solved first which is creation training retraining of models with reusable ml pipeline so the first Focus pillar in the feedback loop is this the second one is deploying and versioning of ml models so you see that you will be having your continuous delivery for ML models done and it is very similar to The Continuous delivery of software but the only difference is so software is more or less static and software is more or less static and machine learning is is dynamic because of the nature of data right then you have something called audit ml models and artifacts essentially this is from an auditing perspective from a governance perspective you will need to have some artifacts monitored audit the ml pipelines monitored right and audited and you will also have to look at what exactly is changing in your data so that you can adapt and you can alter it on your models to fit the needs right both the business and Technical needs so from a stakeholder perspective right you have different parties as part of mlops uh the first one is essentially the business stakeholder which should be involved pretty much early in the planning phases you should be getting inputs as uh with respect to what what one needs to deploy what are the business requirements what features one need to stress on and the focus should also be on how the model should be deployed right because model either can be deployed in a real-time fashion or can be deployed in a in a embedded fashion right so more or less these are two broad categories of how you have the models deployed so you will need to have a clear alignment with business whether they want real-time predictions to be done or whether they want embedded predictions which which happen over a batch right so all of those things actually help you in having the feature engineering and preparation done you have data scientists which are acting on all of these things with respect to data preparation model training model experimentation having a playground and permutation combinations of different models tested out and you have data Engineers which support the requirements from a platform perspective and even give you the way to have data ingestion done then you have some devops people you have data Engineers you have machine learning Engineers which are essentially working together to solve machine learning code on the deployment the serving of the code monitoring of that CI CD perspective containerization of that particular code and evaluation from a drift perspective from an auditing perspective so all of these stakeholders have to work together so essentially if you see it's it's very much the flow which we saw in the feedback loop and it aligns from a stakeholder management point of view as well so a high level envelopes workflow which will look like this so you will have your data ingested it will be kind of transformed uh so this portion involves involvement of uh the data engineers then it is followed by the data scientist you have the data preparation done which is which is given to the ml ml Engineers you will have the model build you will have the optimizations done you have the model served and essentially it is followed by logging monitoring and the iterative loop for the audit Etc and fine tune foreign Ops is from a broader perspective so how we can actually realize is is something that that is interesting right so till now we have covered some theoretical portion uh now on we will cover some theory of airflow which will be followed by a demo so essentially what is airflow air flow is is an orchestrator and why orchestrator is needed is because all of these phases in the mL of architecture or workflow which we saw these phases are to be operationalized right these phases are to be orchestrated each of these steps uh have something to pass on to the next step or maybe refer something from the previous step right so all of these phases are connected there needs to be exchange of information there needs to be uh some coordination between these tasks and because of that only you will be able to have end-to-end streamlining of your ml pipelines so airflow is an orchestration tool there are other tools in the market as well which are equally good but airflow being one of the oldest and one of the uh most reliable right because it is it is a very mature Apache project it is something which which many people uh do consider when having data operations uh or orchestration of data pipelines and ml pipelines right so you have other tools like kubeflow ml flow Etc prefect which which can be utilized as well all of these tools have their own pros and cons right so open its airflow is an open source tool and it's used for developing and scheduling monitored batch oriented workflows okay uh again the issue is persisting is it yeah we can see oh yeah now it's fine okay yeah so it's like I didn't do anything so like I didn't go to the next bullet point right because I were discussing about the first movie so yeah yeah so I'll cover this tensorflow air flow towards the Android so uh it's a python first uh it's a python first framework and supports parameterization of Ginger uh using Ginger templates as well so essentially you can write your pipelines uh supporting uh through the through the python code and essentially all of those things can be managed through python code all in all and you can even have the sequentialization ETC done and just to stress on this is something which is therefore batch oriented workflows only right so streaming Etc are not the use cases then there is a concept called dag which is directed acyclic graph what this means is each of the step essentially yeah so what is a pipeline here okay so pipeline here is is the combination of the steps which we saw earlier these steps are essentially uh your these steps are essentially your data cleaning process data ingestion followed by transformation going to uh uh kind of model building and then optimization of the model then serving the model and followed by all the monitoring auditing and CI CD kind of aspects right so a pipeline is a combination of all these steps uh linked together so this is this is what you are having so you see the arrow going from data cleansing to data ingestion Etc and this is kind of going like this and then you have model the model is served and and then you have the monitoring done so pipeline is essentially a step a sequence of steps which are followed to serve your ml model starting from the exploratory phase to the productioning and a production phase and Then followed by maintenance and it goes on and loop there right okay yeah so so coming to Dag workflow dag is essentially a directed it's cyclic graph and that is the foundational concept of airflow uh what dag serves the purpose of is and we will see this in the demo so say you have task one which is data cleaning and now you want to do feature engineering which is the second task right so inputs from task one should be passed to task two right so this means it is directed so task one is to be followed by task 2 which means it has some direction right so one node has a another node which is following it a cyclic means you cannot have a cyclic cyclic Loop so task two cannot go back to task one right so essentially if data cleaning is happening and you are passing the data to the data preparation phase uh you cannot have the data preparation phase go back to data cleaning right so input of data preparation will not be given to the data cleaning phase so that's acyclic right which means you don't have any Cycles between these nodes okay so directed means you have vertices which are directed acyclic means it is not cyclic and graph means this is a group of nodes which have some vertices and these have directions which are not exactly okay which are which are recyclable right so this is what a dag is and we essentially can have all the flow steps which we just saw represented using python as code and all of those things can be called as functions using python or maybe using the syntax of airflow and that is where you will get a DAC formed and we will see that in the demo session all right this is extensible with operators so you have many Integrations and essentially it has many Integrations and essentially it can be integrated with third parties even so if you want to have bigquery connector you want to have Cloud SQL connector you want to have my SQL you want to have Hive you want to have any DBT or any other connectors right even with kubernetes gke Etc S3 EMR map reduce right all of those connectors are available Presto Etc so it's extensible with third party tools and you have task dependency management you have parallel execution so if there are tasks which can be executed in parallel maybe there are multiple data cleaning jobs which you are performing or maybe there are data cleaning jobs followed by a grouping of the data or maybe chaining of the data or maybe clubbing the data together you you may have multiple jobs for that and you may have multiple data sources so all of those tasks which are capable of being performed parallely will be executed partly as well so that is that is the beauty of this framework and if at all due to some issue you are not able to have your task execute properly there are retry and notification mechanisms as well so you will get the proper alerting monitoring logging reporting of these things done so if say during execution of task you had some network connectivity issue with your data store because of which the task field right it will report that it failed the task and depending on what configuration you have provided for retries it will be able to retry the task right and as many times as you specify it will be able to retread the task and unless the current task is successful task 2 will not start getting executed right so if task one is unsuccessful and is in retry Loop since task 2 is dependent on Task 1 because of that directed vertex that we have we will not have passed two execution started right so this is essentially wherein if at all anything is breaking in your pipeline you will not have the uh the subsequent phases execute and you will be stuck on that particular phase which will require your attention and once you fix it it will automatically start the execution of the next phases so talking in terms of the airflow components there are essentially four airflow components or four critical infrastructure components from an uh from airflow perspective one is the airflow web server one is airflow scheduler one is the executor and one is metadata store right so what these components mean let's understand them and all of these are something which we will be visualizing in the demo so you will get to see all of these components and we will make references back to all of these things which we are discussing so airflow web server is essentially a web UI okay you have the dags or the directed acyclic graphs which you are writing in Python available on the UI and there is provision to go about having those tags run you can schedule the Run of those tags so essentially treat it as something which is a dashboard which is being offered to you through which you can have these pipelines executed through dags you can have them monitor you can check the logs you can have these orchestrated you can you can go and you can have some notifications laid out you can form connections with the databases external databases maybe bypassing the password username Etc uh the the connection ID and all of these things are also possible through the management apis of your flow so airflow has a rich API you can you can utilize that for interacting with the web server components or you can use the UI or there is even CLI which which you can utilize right so web server is essentially the management plane you can consider which is showing you these things uh so essentially it's it's the core component of the management plan or master plan scheduler is essentially a component which is used to schedule these workflows or pipelines right so suppose you have two tasks one is data cleaning and that is followed by data preparation now data cleaning is the first task that should be getting executed in order to get executed it should be scheduled first to some infrastructure right it may be a virtual machine on which it is getting scheduled it may be a kubernetes pod with on which it is getting scheduled so essentially scheduler is actually taking care of having your tasks or the steps scheduled right so task one is getting scheduled it is getting executed so task two will not get executed unless Task 1 has successfully executed that is something we learned right and there are mechanisms through which we can see the task 2 will remain in some queue unless Task 1 gets successfully executed right so queuing mechanism is also something which is present and that is part of a few uh executors that are uh part of airflow so executor is essentially uh the component which helps you execute these tasks right so first we have web server which is offering us UI management apis to have these tasks reported and registered in airflow and then you have scheduler which is going to actually schedule these tasks when these tasks should be uh kind of are ready to execute so it is the scheduler which which takes care of that right and you can even provide some Crown schedule you can provide some presets based on which you want to have the tasks executed right so you can say that I want my task to start execution only uh at 2 am uh in in the early morning and I don't want it to be running in the evening or afternoon time right so you can decide on when your task should run what's the silence period for your business and then then you can kind of have according with the tasks run right then there is something called as airflow executor which essentially gives you the power of having these tasks literally executed so you will have these tasks either executed on Virtual machines or kubernetes or these tasks executed uh using some uh kind of third-party Integrations so executor takes care of having these tasks properly initiated on the target system maybe virtual machine or maybe pods and then from from there you have these Integrations uh getting triggered so again execution may be a cue based execution or it may be just like a task is getting assigned and you have a new Port getting spun up and that part is executing your workflow step right and then you have a metadata meta store or database which essentially uh literally means what it it is named after so it is storing the metadata of all these execution executions that are happening of all the management that is going on with the cluster the web UI whatever updates it is getting all of them are getting stored as metadata likewise scheduler will say that okay I have scheduled this job now whether it is successful or whether it has got failed or something which will be reported back to the meta store so this is essentially a central database a metadata database which you can maintain right and there are different options for these uh metadata stores as well so you can have postgrade you can have SQL so there are different options uh to support this meta data store as well and coming to executors which is going to be a critical thing before we get on to what what uh is there in kubernetes and our demo so there are uh majorly four executors but there are many more as well uh one is sequential executor which will ensure that uh you have your jobs sequentially executed so no parallelism or no concurrencies here and this is executed in a local environment right so if you are running airflow on your laptop you have these tasks executed sequentially one after the other even though the tasks are of parallel nature some of the tasks may be executed in parallel but your machine will not allow because you you have selected sequential executor so you will have a sequential execution of the tasks in your ml pipeline right then you have local local is essentially A variation of sequential but this is something which can be made parallel right so it runs on local but it has the parallelism feature right then there is salary executor for the people who are from python background they would know that salary is a asynchronous task queue management system and essentially where it is getting leveraged is to have production-based ml pipelines or uh air flow tasks uh kind of orchestrated so you you have essentially a queuing system in between like rabbit queue uh rabbit mq or redis through which the tasks are queued and essentially you have Target worker nodes and whichever worker node is uh is is adhering to the configuration required for the task and whichever worker node has the capacity to run the task will will be chosen and the task will get scheduled from the queue on that particular node right and once that task execution is done whatever is is the next task will get executed if you have multiple nodes you can scale out you can horizontally scale these nodes so if you have multiple nodes you can go about having multiple tasks executed in parallel so you have the parallelism you have the concurrency settings and that is why it is a production grade offering or a production grade executor right then comes kubernetes which is fairly recent in this ecosystem and that is again production grade and it also has a variation of having salary run on top of kubernetes so you can leverage the salary uh the the main kind of benefits of celery and the benefits of kubernetes from a scaling perspective as well right so essentially from a kubernetes executive standpoint what we need to consider is that all the other things uh which we saw sequential local celery these were running on Virtual machines or maybe on machines or or nodes but kubernetes is essentially where your task will run and pods okay so you will have pods created in kubernetes on which your task will be running and then if at all you choose salary you will have these additional queuing components which will be coming in picture again those will be deployed as pods okay but you will have the same mechanism of asynchronous task queuing Etc and distribution of tasks respected so it will bring you the power of kubernetes it will bring you the power of salary it's an option okay there is an option to go with kubernetes but not with salary as well right so all of those things are essentially uh possible Right uh yeah I'll be sharing the presentation on material uh so yeah and and someone is asking about cron scheduler so uh essentially cron scheduler is something which you can think of as a traditional thing to go about scheduling the tasks air flow or orchestration Frameworks are giving you a lot of power on top of what a crown scheduler has to give right so Crown scheduler will just be able to have your tasks run According to some Cron job or a Cron job preset or or any schedule but airflow is giving you management capabilities data sharing capabilities integration capabilities retry mechanism dependency mechanism uh dependency tracking and all of those things that we discussed in the introduction of airflow right so it's it's very very uh advanced than cron schedulers right so don't uh mix the concepts yeah the the context is that it starts with cron scheduling as an idea but then it goes on and builds on uh on top of it and and it has really matured in in that space right now coming to kubernetes there are three essential ways of how you can have airflow running on kubernetes one is through the kubernetes executor which we saw wherein you also have an option of having kubernetes along with salary running but then there are two more options and Keda which is kubernetes event Driven Auto scaling uh platform that is uh one of the most kind of recent and stable kind of I can say releases or features of airflow so we'll go through these one one by one so first thing is the kubernetes Pod operator the kubernetes Pod operator essentially is different from kubernetes executor in a way that in kubernetes pod operator you will have control over the machine contain the the container images of each of the tasks right so you can have each of the tasks like data cleansing can have its own container image can have its own Hardware requirements specified node pools specified labels selectors taints tolerations Etc specified so all the kubernetes constructs all the kubernetes API constructs from a scheduling perspective are applicable to each of the tasks and you can have each of these tasks run in different containers in diff using different images using different configurations you may require one task to run GPU one task to run CPU all of those things is possible uh using the Pod operator how it differs with tutor is that executor just follows one standard container image right so it's a beefy image that you will create that will have airflow that will have all the dependencies that you have for Python and all the dependencies you have to start Integrations with the other platforms maybe say you are using bigquery so you will need to have Cloud support so you will need to have the Cloud Library or SDK for python also installed as pip requirement maybe if you are using scikit-learn you will have a scale on uh and and pandas Etc to be downloaded and it will be a single image and that image essentially will be getting every time up as in form of a pot and you will have the task run on the uh using the same container image essentially but but these will be like different uh pods that that you will be running right whereas in kubernetes pod operator you are getting pod level uh kind of control wherein for each task you can have different definition for the Pod right so it differs in this way so executor is like a standard uh configuration you follow and all the tasks get executed in the uh in the in the con in the Pod that that that is there whereas in kubernetes spot you will have individual Parts created and that is where you can uh kind of have this different configurations managed and then comes the kubernetes event Driven Auto scaling uh feature wherein you have uh this is this is essentially meant for salary execution okay so kubernetes event Driven Auto scalar is a platform that helps you Auto scale kubernetes workloads and deployments based on some custom metrics right or events or external metrics or events like if if you want to Auto scale your workloads on uh based on some specific metrics like number of DB connections or number of say the lag in the queue right all of those things can be done using kubernetes event Driven Auto scaler and this is where airflow introduced this option for having salary executor because uh based on the overhead of request based on the tasks that are there right it will be able to leverage the elastic scaling capability of kubernetes and still have salary workers run on that right so this is uh one of the recent editions that are there and essentially it is running as a custom resource definition for the people who who are not aware with custom resource definition it is an advanced construct and kubernetes and there is a there is an object called scaled object with a scalar as a resource which is used uh uh in in crd so it's just a custom definition of one of the resources which the airflow team has defined so that these can be run in kubernetes in the manner other deployments work and it can be Auto scaled using the Keda feature right so I see like uh a question here what is the difference of running airflow with Docker with and with kubernetes so kubernetes is essentially giving you orchestration capabilities whereas Docker you may utilize Docker compose Etc but again those are like old platforms and and you will not be able to leverage the auto scaling capabilities Etc the hardware uh different uh the the hard the configuration for Hardware that you want to perform based on different workloads so all of these are something which you can get from an orchestrator framework and this is where like kubernetes outstranded then there is another question how is DAC fitting in this uh part of the process how is it necessary so this is something which we will see in demo so maybe like two more minutes and we get to the demo right so these are three ways of how airflow can have uh the pipelines run on kubernetes and from a kubernetes executor perspective this is the high level architecture diagram of how it looks like so you have these components which we explained web server which is supporting in the UI you have scheduler you have kubernetes executor and you have the airflow configuration which is essentially powering the behavior of how airflow works you have these dags which are nothing but python code pieces or Snippets that you are pushing and these are getting pushed to the web server because of which they are having these configurations loaded and getting scheduled and these are getting scheduled on the kubernetes executor the kubernetes executor is then going to have the pods created uh based on the uh template the the common template the common container image that is there and these workloads each of these spots will be capable of running the tasks right again there is some problem with screen is it no it's busy okay thanks you see the demo page is it on no no demo Pages yeah it's the same kubernetes executed okay okay visible right yes yes okay yeah so now we come to demo portion right so what we will do is we will create a kubernetes cluster for this particular demo I have created kubernetes cluster and gke which is Google kubernetes engine which is running essentially kubernetes as a managed service on top of Google Cloud platform right so what we are going to do is we already have a kubernetes cluster up and running and we will fire commands again start if at all you have a kubernetes cluster with you you can also follow along the kubernetes cluster doesn't need to necessarily be on a cloud platform you can even have it on minicube or your local using Docker desktop or if you are using client kubernetes in Docker you can use that as well so there is no restriction as well you just require a kubernetes cluster with some capacity so that you can have the workflow scheduled on it right so now what we are going to do is I will bring up uh my terminal right and we are going to have uh the airflow uh thing deployed the airflow component deployed using Helm chart okay so what I have done is I have a kubernetes cluster which is called airflow kubernetes 0 0 okay I believe it is visible and that cluster is having three nodes as of now okay these are the worker nodes on which we will be having our airflow specific components scheduled right now what we do is first thing is we utilize Helm okay so Helm is a wrapper uh on top of uh the kubernetes uh I can say API that is giving you functionality of maintaining your kubernetes releases uh and essentially uh it gives you capability of managing these releases in a very intuitive Manner and you can perform rollbacks you can have maintenance done you can track the releases you can have Registries added and essentially you can utilize uh existing uh uh workloads maybe you want to deploy redis so there is no need to go about having red is created by you uh essentially you will have to go and and have uh you you can pull the redis image and related artifacts or chart as we call it in Helm uh from the helm repository or registry and then yo

Original Description

The lifecycle of a Machine Learning model is really complex. There are a number of phases starting from the exploration phase, which go up to deploying the model in production and then further maintaining it. Each lifecycle phase has its own set of hardware requirements and even information from one phase needs to flow to the other phases. In this DataHour Anmol will be explaining MLOps from the start covering end-to-end operationalization of ML models using the power of Airflow and Kubernetes. 🔗 More action pack session here: https://datahack.analyticsvidhya.com/contest/all/ Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 38 of 60

1 The DataHour: Data Science in Retail
The DataHour: Data Science in Retail
Analytics Vidhya
2 The DataHour: Anomaly detection using NLP and Predictive Modeling
The DataHour: Anomaly detection using NLP and Predictive Modeling
Analytics Vidhya
3 The DataHour: Energy Data Science Project from Scratch
The DataHour: Energy Data Science Project from Scratch
Analytics Vidhya
4 The DataHour: Explainable AI Need and Implementation
The DataHour: Explainable AI Need and Implementation
Analytics Vidhya
5 The DataHour: Google Cloud AI/ML
The DataHour: Google Cloud AI/ML
Analytics Vidhya
6 Prediction to Production in Machine Learning #machinelearning #prediction
Prediction to Production in Machine Learning #machinelearning #prediction
Analytics Vidhya
7 Practical Applications of Data science in Ecommerce
Practical Applications of Data science in Ecommerce
Analytics Vidhya
8 How to tackle Overfitting?#machinelearning #overfitting
How to tackle Overfitting?#machinelearning #overfitting
Analytics Vidhya
9 Building Data Pipelines on GCP #googlecloud #datapipelines #data
Building Data Pipelines on GCP #googlecloud #datapipelines #data
Analytics Vidhya
10 Hands-on with A/B Testing #abtesting #datascience
Hands-on with A/B Testing #abtesting #datascience
Analytics Vidhya
11 Efficient Implementations of Transformers #transformers #cnn  #machinelearning
Efficient Implementations of Transformers #transformers #cnn #machinelearning
Analytics Vidhya
12 Modern Deep Learning Architecture #deeplearning  #architecture #deeplearningtutorial
Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial
Analytics Vidhya
13 Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning
Analytics Vidhya
14 5 things you should know about Azure SQL #azure #sql #datahour #datascience
5 things you should know about Azure SQL #azure #sql #datahour #datascience
Analytics Vidhya
15 AI & ML in the Automotive Industry #machinelearning #ai
AI & ML in the Automotive Industry #machinelearning #ai
Analytics Vidhya
16 Building Machine Learning Models in BigQuery
Building Machine Learning Models in BigQuery
Analytics Vidhya
17 NLP aspects in Telecommunication Industry
NLP aspects in Telecommunication Industry
Analytics Vidhya
18 Practical Time Series Analysis
Practical Time Series Analysis
Analytics Vidhya
19 Fundamentals of Quantum Computing
Fundamentals of Quantum Computing
Analytics Vidhya
20 A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)
Analytics Vidhya
21 Classification Machine Learning Model from Scratch
Classification Machine Learning Model from Scratch
Analytics Vidhya
22 Knowledge Graph Solutions using Neo4j
Knowledge Graph Solutions using Neo4j
Analytics Vidhya
23 Model Guesstimation (MLOps)
Model Guesstimation (MLOps)
Analytics Vidhya
24 ETL Pipelines in Google Cloud Platform
ETL Pipelines in Google Cloud Platform
Analytics Vidhya
25 Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Key steps for Designing Convolutional Neural Network(CNN) for Image Classification
Analytics Vidhya
26 Getting Started with AWS EC2 #amazon #aws
Getting Started with AWS EC2 #amazon #aws
Analytics Vidhya
27 How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining
Analytics Vidhya
28 Certified AI & ML BlackBelt Plus Program #shorts
Certified AI & ML BlackBelt Plus Program #shorts
Analytics Vidhya
29 Visualizing Data using Python #machinelearning #visualization #python
Visualizing Data using Python #machinelearning #visualization #python
Analytics Vidhya
30 DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience
Analytics Vidhya
31 M in ML stands for Math & Magic
M in ML stands for Math & Magic
Analytics Vidhya
32 An Unsupervised ML approach using Clustering
An Unsupervised ML approach using Clustering
Analytics Vidhya
33 Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience
Analytics Vidhya
34 Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning
Analytics Vidhya
35 Practical MLOps #mlops #datascience
Practical MLOps #mlops #datascience
Analytics Vidhya
36 Data Engineering with Databricks #dataengineering #databricks
Data Engineering with Databricks #dataengineering #databricks
Analytics Vidhya
37 Multi-Objective Optimisation
Multi-Objective Optimisation
Analytics Vidhya
When Airflow Meets Kubernetes
When Airflow Meets Kubernetes
Analytics Vidhya
39 AI in Banking
AI in Banking
Analytics Vidhya
40 Learn Convolutional Neural Network for Image Recognition
Learn Convolutional Neural Network for Image Recognition
Analytics Vidhya
41 Extracting Value from Data
Extracting Value from Data
Analytics Vidhya
42 How to measure Marketing Channel Effectiveness
How to measure Marketing Channel Effectiveness
Analytics Vidhya
43 Transforming Lives | Data Science Immersive Bootcamp
Transforming Lives | Data Science Immersive Bootcamp
Analytics Vidhya
44 Stock Market Analysis - AI driven approach
Stock Market Analysis - AI driven approach
Analytics Vidhya
45 Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Become a Data Engineering Professional in 2022 | Future Trends + Skills Required
Analytics Vidhya
46 Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience
Analytics Vidhya
47 The Power of Visualization | Tableau Full Course | Analytics Vidhya
The Power of Visualization | Tableau Full Course | Analytics Vidhya
Analytics Vidhya
48 Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya
Analytics Vidhya
49 Data Visualization in Data Science | DataHour | Analytics Vidhya
Data Visualization in Data Science | DataHour | Analytics Vidhya
Analytics Vidhya
50 Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya
Analytics Vidhya
51 Solving any Machine Learning Problem | Approach and Steps Involved
Solving any Machine Learning Problem | Approach and Steps Involved
Analytics Vidhya
52 Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly
Analytics Vidhya
53 Data Engineering in E-Commerce | The Best Case Study
Data Engineering in E-Commerce | The Best Case Study
Analytics Vidhya
54 Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya
Analytics Vidhya
55 Introduction to Federated Learning | DataHour | Analytics Vidhya
Introduction to Federated Learning | DataHour | Analytics Vidhya
Analytics Vidhya
56 Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Diffusion Models for Generative Arts | DataHour | Analytics Vidhya
Analytics Vidhya
57 Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya
Analytics Vidhya
58 Learn Hypothesis Testing | DataHour | Analytics Vidhya
Learn Hypothesis Testing | DataHour | Analytics Vidhya
Analytics Vidhya
59 A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya
Analytics Vidhya
60 Making AI work for Business | DataHour | Analytics Vidhya
Making AI work for Business | DataHour | Analytics Vidhya
Analytics Vidhya

The video teaches how to manage the end-to-end operations of machine learning models using Airflow and Kubernetes, and why it matters for achieving operational excellence in machine learning.

Key Takeaways
  1. Ingest data
  2. Transform data
  3. Prepare data for ML
  4. Build ML model
  5. Optimize ML model
  6. Configure executor
  7. Initiate tasks on target system
  8. Store execution metadata
  9. Schedule tasks in queue
  10. Execute tasks in parallel
💡 Airflow and Kubernetes can be used together to manage the end-to-end operations of machine learning models, providing a scalable and production-grade solution for machine learning pipelines.

Related AI Lessons

Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · AI
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression
Medium · Data Science
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques
Medium · Python
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →