The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya · Intermediate ·📊 Data Analytics & Business Intelligence ·3y ago

Skills: Unsupervised Learning80%RAG Basics70%Vector Stores60%RAG Evaluation60%Advanced RAG50%

Key Takeaways

The DataHour session covers anomaly detection using NLP and predictive modeling, focusing on software engineering and product-oriented companies, with techniques such as seasonal decomposition, contextual analysis, and simple rules-based approach.

Full Transcript

so everyone good evening and welcome to the another session in the data series we are thrilled to have you here this evening for an interactive learning session idea zane part of the data science team at analytics withya will be moderator for the session so for those who have joined us for the first time a brief introduction about the data are sessions with the intent to make learning data science more engaging to the community we begin with our new initiative data r which is one dedicated to data data r is a series of webinar led by top industry experts with where they where they teach and democratize data science knowledge now on to our session today which is anomaly detection using nlp and predictive modeling in this data session paritosh will cover the fundamentals of anomaly detection and discuss its application in job management and exception handling rather than proposing a single approach the session will identify the nuances that should be considered while designing an anomaly detection solution the session will be aided by real life examples that leverage nlp predictive modeling and other ensembles methods to identify correct and prevent anomalies before we kick things off and i handed over to the session present presenter a quick recap of a few things first we are recording the session and and we'll make the recording available in a few days on our youtube channel second please use the q a section for asking any question you might have during the session as the data are progresses towards the end we will do our best to answer them third also we'll share our feedback poll towards the end of the session which i request everyone to participate in now on to our speaker in the session of data r we have paritos sinha with us senior data scientist at uber problem solver quick learner and experienced team lead paritos is a tenure data scientist with 10 plus years of experience in using machine learning statistical and nlp techniques to solve business problems across consulting services and product based organization he is currently working as a senior data scientist in the marketing division at uber you can follow him on linkedin i am sharing his linkedin profile in chat over to you paritush the virtual stage is all yours thank you and uh thank you for the introduction as well i hope you can hear me clearly yes yeah okay i hope my video is working yeah okay cool so nice to see uh a large group of people joining us for the data series today uh my introduction is already done so i i won't do it again um the focus for today is uh is on anomaly prediction and how do we use that how do we execute that uh using nlp in predictive modeling uh the one thing i want to really focus upon just today's data i will focus on a lot of real life examples and how i have solved it uh based on my previous experiences so i would love to have a discussion around the approach there is no right or wrong as i've already said it all involves based on the requirements and i am looking forward to a discussion so let's get started yeah i'll share my screen yeah yeah okay [Music] confirm [Music] so yeah so like i mentioned uh the focus uh two days on a normal uh the domain of which will be focusing on software engineering so a lot of uh product oriented companies i also are deploying related situations to essentially automate their operational operational processes and ensure that there's an intelligent layer into it so that the human element is reduced right uh so there are three parts or there are three items in the agenda uh given that nearly eighty percent of the audience is new to data science and machine learning with less than three years of experience i'll do a quick refresher on an omelette actually that should take us around 10 minutes uh in the next phase i will then quickly jump into the use cases and stuff engineering so i'll talk about two um and and like i mentioned this will be more interactive than anything else and uh when we are discussing the use case we'll discuss the problem background we'll discuss the approach we'll discuss the potential solution and then we'll discuss what kind of your voice is you know quite low can you please speak a little bit louder yeah it's better better yes great yeah i was mentioning if they just do a quick recap of the agenda so we are doing a refresher retention uh second we'll be focusing on the use cases in software engineering so i'll talk about two and then we'll reserve the last 10 minutes of giving yeah uh so let's start with it so uh what is an omni detection let's just start from there and and then we'll move forward so normally the data points are display an unexpected behavior against usual patterns right uh so the thing i want to focus here on is one unexpected behavior and the second one is usual pattern so the whole exercise which we've been going forward with we'll just focus on these two elements so one we need to establish what's your baseline or your expected baseline behavior and then uh we'll be looking at which set of data points or which group of data points are not behaving based on the pattern which we've already established so that is all a nomination is all about right uh why is actually important because there are two things one it gives you a signal on what are the underlying background conditions uh which are causing the anomaly to happen so to give you a very weird example think of an omni detection in a manufacturing center if there is an equipment failure that is not just an equipment figure but there are certain background conditions such as uh raised pressure increase in temperature or some behind processes which are not worth functioning what you will see is that downstream processes will also be affected right and that is where the impact kind of cascades like there is no controlling the final outcome so that is the reason why first we detected then we also want to predict something so that we prevent failures from downstream processes right and like i mentioned so like detection is simply the process of identifying those tasks and predicting uh the future energies to prevent right now for this exercise what are the key characteristics of those anomalies which they're trying to detect right uh and this this is a little bit tailored to what we will see in the upcoming slides as well uh but these characteristics of these anomalies are essentially these are quite sporadic and rare so usually the current state will be in in maybe say early single digits of a percentage and may be lower and you will never see double digits of double digits or higher percentage uh for anomaly rate right so that is why it could be sporadic and rare the second thing is that the anomaly should be traceable by that what i mean is given this is a data centering exercise whatever we are trying to do is we are trying to either identify or predict or do both uh we need to trace the anomalies specifically when they happen and what are the conditions around them right so that is where the anomaly should themselves be traceable uh it cannot be a very good understanding of is this anomaly or this is the second piece of information is once you identify an army you need to have some background information because that is the whole essential construct on which you will either uh date an anomaly or you will credit it so what is an anomaly and what are the events leading up to both of them are really important uh and finally whatever data you have captured it should be explainable uh by that what i mean is there are the anomalies because they are uh they are eventually so low their presence is so low there are often multiple external conditions where you're not able which you're not able to factor in by building the nominal detection of the prediction model and that leads to higher failure of these use cases because if you don't have the underlying factors which kind of rare it should be traceable it should be background information and uh and they should be explainable as well right um quickly jumping into some common anomaly types what you will see is uh there are three types of anomalies which will mainly focus on in this particular session and generally the anonymously fall under these three regions uh one is global anomalies so global economies as you will see from the regional as well uh global anomalies are those at a point which assume a value which is far from the expected range of the general range right so there is no context to it uh it's just an absolute value identification of what does the uh range of values look like and what is way higher available right contextual anomalies are those where we need to take care of uh multiple factors while identifying what's enemy so a contextual anomaly will be for example dependent on the day of the week time of the day the weather conditions right so there are a lot of external factors which will influence the presence of the anomaly and finally collective outliers are those set of outliers which uh not only focus on one point but focus on a group of points showing and showing an expert or showing a big deviation so for example if you look at the chart here you will see that there is a flat trend almost but there are group of data points which are showing uh deviation from normal behavior and that is it's a collective environment so every time you're trying to execute a an anomaly detection exercise we should be one careful about what type of anomaly it is and also try to understand what's the context behind it so that we create the process right [Music] okay so let me move forward so just so let's now deep dive into uh the use cases which i wanted to focus on uh these four use cases like i mentioned are mostly the software engineering uh there are two which we'll talk about so i've tried to begin as much as real life data uh and visualization as possible but i am i'm obviously restricted by the privacy needs of my lord but i have tried to ensure that you that you guys get the context of what we are trying to do how do we know so let's look at the problem overview uh which is fun look at the first problem so the first problem that is going back is focusing on a normal detection in data indication jobs right so how does it work so the background is there's a marketing division wherein data from multiple partners and these partners can be something like google like an apple or facebook is which essentially uh track and share data for uh for advertising spend and advertising revenue so what they do is we capture this this data from different partners now there are multiple partners and there are hundred plus partners and the data education process happens early so naturally what that leads to is there are um anomalies in your data processor think of it this way so for example if you're trying to understand so you have spent certain dollars and what you want to do is you want to understand what is the kind of conversion should be getting from that same particular value so what google will share is what's the amount you spend and what's the kind of fix it's almost and that information will be shared early so that because the data is quite huge we cannot do a one-time download or a one-time injection into our tables so that that data is ingested early and that is what helps us to track what's the spend what's the conversion and what's expected right now given the complexity because uh there are hundred plus partners there are 24 injections happening during like 24 hours so what happens is that often either the data injection is corrupted because of uh the underlying data sources not being correct or due to server issues there are cases where the data is not imported correctly or the full data is not important so in that particular case we'll see a big deviation in the normal behavior which is the expected measure or in the imported data versus what the expected behavior is and that is what uh is the overall problem which you want to solve for right the the underlying problem and what also happens in the real life scenario is uh the data is being used by business teams and often the software engineering or the engineering folks don't have a tracking mechanism to identify what is the data input correctly or not or are the data interested correctly or not and that is where the downstream business changes when they create the reports often realize that there's a big deviation from what they are expecting or the values have changed quite significantly and that is where a process is required uh before the data actually reaches the business teams to ensure that um one there is not enough delay for example say the business team is using that data three days from the time of condition there's not a three day dealing identifying these anomalies and two once that delay has happened the process needs to run again because it needs to be interested again and that is the other two things which we want to avoid so we don't want to we want to minimize the delay in the detection one and once these are identified if it is done on the runtime we can automatically trigger these scripts and waiting for that so that's the problem overview uh now what are the challenges uh which which we need to consider while we are devising here for this particular one right uh one the occurrence of these anomalies are acquired so if you think about it there are 24 into 520 plus suggestions for the hundred plus partners i talked about and uh what we need to and like the occurrence of these real uh occurrence of these uh the occurrence of these failures is quite low so you will see roughly two to three anomalies per week for more than say uh twelve thousand uh injection partner combination so that is where the occurrences are very low but the downstream impact is very high because for example if you're not able to track what's the marketing spend correctly on on your google channel or if you're not able to like watch the conversion from the facebook channel the downstream business uh impact and even the decisions you're taking going forward will also be battered so one of the occurrences are very low the second part is the data displays a seasonality and a trend pattern and you will see this as we move to uh the other details but uh what essentially happens is that we do not have an absolute criteria which can be used to identify these anomalies because going back to the titles of the norms we discussed earlier these anomalies are mostly contextual environments so we need to take care of the context while identifying the anomalies and flag the corresponding partners or the corresponding engineering folks the third part is given the number of injections which happen every hour this cannot be a very complex solution uh which requires uh many hours or many minutes to run this has to be a loaded solution which has to be integrated with the engineering injection apis and the anomaly alert should be raised quickly right and finally each partner our data with each partner will have a different trend right so the detection technique should work across multiple use cases uh so think of it this way if your your spend on google will be different your spend patterns on movement on google will be different versions on facebook or on say snapchat and that is what needs to be taken care of when we are devising the approach it cannot be that we have a specific one approach for each of these partners that are not really well that cannot lead scale and that is where we need to ensure that the technique should work across multiple use cases uh the potential impact as you guys might have guessed already it's it's it's multi-million dollars because when the business teams get impacted the type of business decisions which are taken they also get impacted and the delays are pretty huge that is where the exercise becomes really important in an engineering world right so now that we have the the problem context let me talk about the use cases right so these are three types of use cases what you will see uh or what i observed in the data uh one if you look at use case one it's essentially showing a very seasonal pattern with some big anomalies in this particular region right so that is where we want to and and there are other anomalies in in these particular areas as well where the data points are not showing uh where they are showing quite there are a big deviation from the regular values so that is kind of use case one uh the key thing to note is that there is kind of net trend in this particular data you will see it's mostly flat across a longer period of time right the use case too so think of use case one as data being uh imported from our industry from one partner use case studios being interested for another department so say if you're looking at your spend on google so you're spending google for example follow this pattern but you're spending on facebook will follow a some kind of pattern right uh so if you look at use case two what you see that the anomaly types are different now why do i say that and uh yeah for for clarity's sake i think there are questions i'll i'll come to the uh questions part in the end so let me or should i take care of it right now i just wanted to check with the analytics video could you help me out should i take care of the questions now should we do it at the end of the day uh as a wish as as a wish if you want to take it right now you can take it or you can take it at the end of the session okay okay okay so let me take care okay okay let me actually do it at the end of the decision at the moment you just break to flow but yeah it's good to have these questions coming along so if i look at use case two uh the the scenario here or the kind of context here for announcements is quite different this um do not consider the red points for now i'll talk about the red points later but if you look at the blue curve what you will see that there's a fairly flat line for um unwanted pattern should look like but then suddenly there's a big dip and post that again then you will see kind of a flattening pattern of of what the data ring should look like and that is where we don't want to in this particular use case we did not want to classify all of the points from here as an anomaly but we wanted to see where the dip was and then if the pattern again becomes stable you don't want to say that's an unknown right which is again very similar to kind of use case three then you will see that we don't want to classify all the data points within a particular range as an anomaly so if there is a big shift but that shift is retained we should uh be able to consider that shift and only classify those shifts and then classifying the whole data so the kind of expectations from the algorithm changes based on the use cases which are essentially what kind of data we're tracking for from each partner and to give some idea of the to give some kind of an idea of what the kind of complexity looks like the pattern of spend from google will be different from the pattern of conversions from google will be different from the pattern of conversions from facebook so each metric and each partner will have their own customized pattern and the the trick here is to ensure that the that the normal detection algorithm works at scale and there is no particular right we don't need to customize really heavily based on each partner and combination right so now just moving on to now that you have a view of what the problem is you have a view of what the data patterns look like uh the approach here was pretty simple right it's a three-step process which we're using to identify the anomalies uh a lot of a lot of techniques were really uh taking care or were considered while designing the approach uh but these three steps were forming the core part of it so let me talk about each of the three steps in a little bit more detail before i focus on on the specific approach so initially because the data is raw data so you will have uh some kind of geography you will have some kind of cluster of the user and then you will have the corresponding clips and so on and so forth because it's all rolled up data there is no uh specific data for a user which is ever tracked so uh that data is essentially rolled up at a weekly level or at a monthly level or on a daily level uh on how you might want to identify those anomalies right so the metric is rolled up either it [Music] the total cost bookings either it's the expected gross booking so there are multiple metrics which are computed either at a daily level or at a higher level to perform the mutation existence now the anomaly traditional exercise itself focuses on two particular items one is classical seasonal decomposition because you will see in most of the curves below before if i refer to this you'll see that there's a seasonal pattern to it it's either weekly it's either monthly um but there's a seasonal pattern to it and all of these anomalies in this particular case are contextual so when you take care of uh context when identifying whether a data point is an anomaly or not right and then finally once the season decomposition is done then we can take care of some simple rules to look at uh whether data point is anomaly or not by just looking at whether it is the range after the seasonal decomposition part is accomplished right and finally once uh once the uh anomalies have been detected uh the idea is that we trigger an email so that we are able to either notify the the business owners or notify the engineering teams trigger desktops right or for business teams essentially before they take their further business decisions that's that's a high level um overview of the approach what i what i wanted to do was i wanted to focus on these two particular elements i think that is the core part where the group will be assigned a lot so let's just deep dive into these two elements so if you look at the uh approach let's just talk about the first element which is classical season decomposition uh this is something which uh folks familiar with uh time series forecasting or someone who is released from validation their curves associated with what we will see is that uh if you look at a particular curve there are two elements to take care of or two items to carry one is the seaside which is nothing but a repeating pattern across these uh months or so across years right and then there's a trend pattern so uh for folks who are not aware with this anxiety and pattern think of your sales in a particular retailer store right so what you will see is that the sales in on uh saturdays and sundays is higher when we change is higher compared to mondays fridays so and there is a trend there is a shape which the curve will follow so you will see this numbers being high on saturday and sunday versus on weekdays right that is kind of reflection of seasonality and again the second point regarding trend uh for retailers who are growing so think of like an amazon think of some other big retailers maybe a flipkart in india you will see that uh their their sales are gradually they increase over time right and that is another trend which we want to capture that is that essentially is trend uh you will see that the values are gradually moving up by following the seasonality pattern so these are the two items or elements which we take care of uh to essentially perform this particular step the classical signal decomposition then tka toolkit reviews and what you will see is that what's the kind of effect this package has on the particular so if you see on the left side let's assume this is say span from a particular partner what you will see is that once the signal composition is applied uh the seasonality is completely removed from this curve and you only have deviations from uh say a baseline which which are being captured in this particular curve now the key thing to highlight is that if if the if a curve is perfectly seasonal you will see a flat line on the yellow side but that's not happening because there are certain deviations from a kind of a flight line which happens post signal decomposition so so that's regarding seasonal decomposition uh once the csa decomposed curve is obtained what we do is we implement an based assessment so what we do is and again uh checking statistics should be really familiar with this we take the 25th percentile we take the 75th percentile uh we use that uh interquartile range and and then we uh designate all values which are say three times above the interquartile range from the seventy five percentile of three times below the intercontinental twenty-five percentile as outliers right uh again i would love to deep dive into this more if you guys want me to uh but yeah the package used here is interpretation now the the core idea i wanted to emphasize on is that implementing this package uh because there is a context which you will take care of so essentially what we are doing is using the first using the csd classification decomposition we are removing the context from the curve and the second part once the context is moving between the range of data points uh to flag out those anomalies which you can see in red so that's the whole uh approach now in terms of once you've designed the approach we want to ensure that we can also track the performance right and performance when we are looking at in a software engineering context is not just about what the decision wants to recover it's also about how much time your skip is taking to execute and how consistent is that particular performance uh or your metric across different use cases so that is where uh what we did was we kind of implemented a simulation methodology and i'll just talk about this in a little bit of details but it's easier for you guys to also implement it if required so think of this particular curve right when you're trying to assess what's the whether a given point is anomaly or not at first when you try to create the approach you will have some designated set of anomalies already present so you can validate the approach which you have in your place right uh now the way to assess whether your car is able to or whether your office is able to meet the required performance or not is to simulate only till the data point on which you have the number so for example let's look at this particular data point if you if you know that the statement data point is an anomaly when you are kind of training your data center to finalize your approach you will only use these data points before that to identify to predict whether this point of the data normally is not and you will not feed the whole curve to the particular algorithm and that is how the whole situation methodology so for every data point in red which you see here only the previous 15 to 20 data points or previous and data points are used only the previous end data points are used only the previous end data points are used to [Music] establish whether the current ingested data point is anomaly or not so if i have to just move forward so you will see for example for a date b1 you have a value v1 for date d2 you have a value v2 and so on and so forth for date b17 you have the value of v7 so what the test does is you first run uh a test one on the first three data points so you look at the trend from trend and these ninety from protein data points are established at the fifteen meter point one when you move to test two you consider all the data points before that and you look at whether the point uh whether the value on the 16 is the hand you know so that's your testing and similarly test three is on the first seventy meter points to establish uh whether uh the data point b is on on daily 17 and on youtube now a common question i've always faced is why you look at different number of data points when you are trying to assess the pattern uh why not look at 50 data points consistently the idea behind that is we want to increase the information being fed while we're detecting reasonably there is no fixed pattern here in terms of uh what the uh what fix now data points you want to consider so either it's 15 16 or 17 so that i really can use it the kind the benefit of using a simulation methodology when we are trying to factor this in is uh it really gives us a real life simulation of of how our model or algorithm will perform when it is different right and it will ensure uh that we are not factoring multiple data points from previous time period and out and the upcoming time being here establishing an anomaly right and obviously it also enables a long-term data access or limitations like i mentioned it's not a rolling window we're taking a longer winter into consideration while we're predicting them and the approach um gives us very good results so it gives us 95 plus recall it gives us less than ten percent false volatility across multiple use cases and the compute time like you see is very low so it's nearly five seconds right so that's about the use case one right uh i hope you guys have got an idea of what the approach is like and and what's the what was the potential results from the same we'll do a little bit of more discussion at the end of the session uh once we open a document right uh the next use case is creating a self-healing system using system logs uh this is uh this is a very interesting problem to solve and the reason why we are i'm saying actually interesting problem solving is because the kind of data which we have is it's unstructured and we need to really figure out how do we should relate those features so uh the idea here is and again to give some background so we are working with the data engineering division uh that uses spark to ingest their data to perform to do data manipulation and and do feature engineering uh each data set which is ingested is run through a job so it's a smart job and the one interesting point here is that there are multiple layers of dependencies across these path charts so if you're familiar with any schema if you're working with any database you'll see you'll notice that there are multiple second layer of tables which are dependent on the primary tables and hence similarly the second level of jobs are written in primary jobs so there is a parent-child relationship of these jobs and these jobs have in sequential order of execution so that is the the challenge which that is the context behind how the jobs are run and also how the jobs uh are executed so the sequential order of execution and then there's a parent relationship so if your parent job fails the child will also fail right so what's the business problem the business problem is that your jobs observe or close to two to three percent failure rate and these failures can uh be across different buckets so it's ethereum syntactical error in script it's either that there are multiple jobs using the same resources the resource constraint or resource usage problem either the parent job itself has failed enhance your child job is failing and there are some other customers which can be defined uh by the engineer right and the second the key problem here also that there is no mechanism to perform an angle like you will realize uh to perform earlier normally detection we need to analyze the log files of the job when we are running the particular chart and so that is that we need to ensure that uh we create certain amount of data from the job exe why the job is executing to ensure that we are able to detect them uh the challenges uh some of the key challenges which we saw whether that was that the data engineers they had to manually scheme through job blocks to identify the reason for failure and then accordingly re-trigger the scripture for example if it's just it's a simple resource usage consumer so what you do is you you look at the resource parameters uh at a time of job and then you wait for them to go down when you again re-ticket the script or for example if it's a parent of failures then you notify the parent job owner to read twitter this script to identify what's the failure reason and then come back to a child doctor and and so on so forth so they can just have to go through job logs identify the reasons for failure and then re-trigger the scripts accordingly often the job logs if you can obviously imagine they are unreadable machine content uh which are often 10 to 20 pages now right so there is no easy way of searching it you can try doing a control f you can try doing uh like a string search but it because these job logs are quite contextual to the job uh other cultural conditions are very uh to the context the job you have to actually change a lot and because this if you're trying to print an normally while the job is running the the problem again has to be a liberated or the solution has to be a low retention solution because your solution to identify or predict an anomaly cannot take more than the job execution time itself uh and and that is where uh the three challenges made this problem really tricky and quite interesting the potential impact of this particular problem is that we had to invest separate set of engineers on weekdays and weekends to just monitor these jobs and identify and rerun those groups so which kind of is a boring and repetitive process the idea is how do we move away from that create a more intelligent way of identifying why the job has failed and you predict if a job is going to fail based on the uh parameters when the jobs are yeah so let's look at the approach what are the key elements of the approach and i'll talk about and focus on the ones which focus uh like which being with an indian machine learning so there are four elements to the approach right the first one is breaking down the logs into different steps uh so a log file as you would imagine is a large chunk of text how do you break it down into different set of steps to ensure that each step is set as a different set of information rather than passing all the content as one large chunk of text right that is breaking down the logs into steps the next part is using these steps to uh predict whether your job is going to be a failure or not it's going to be an anomaly or not that's a normal prediction part of it the third part is if the job has failed uh you look at a certain set of patterns identifying the root cause of the failure and then based on the root cause assessment you do a self-healing of these jobs through automated so these are the four steps which are followed to identify um whether a job is a normally or not a lot of nlp issues in the first step itself right so let's go through it so i think we have i think we have some type let me go through a little bit in detail on each of each one of these steps so if you look at the step breakdown of logs right how how does that work so you have a large uh log file which is raw and there is no content or no readable context within it so how do you break it down into steps you essentially use the time stamps at the start of the line you create a custom functions which essentially go through the first few characters of your individual lines and try to identify if it's a if it's a time stamp or not so now there are multiple types of timestamps there are timestamps across different zones so your inbuilt function should be able to identify if the timestamp or not extract the timestamp from it and then ensure that it is aligned with the same time zone on one which are operating so that is one then what you do is when you're creating these timestamps you ensure there's certain sort of exclusive rules which you need to follow to ensure that your step breakdown mechanism is working correctly so how do you do that you look at time stamps in line n plus one and then you compare them with the time stamps in line n right so we need to ensure that the time stamps which are generally line n plus one are greater than the time stamps in line n and uh the difference between the timestamps of two consecutive steps should not be more than one now how how do you establish sector why is this school applicable is because generally the jobs will have a typical execution type of around 30 minutes to an hour so individual step each itself should not take the full average duration of the job execution right so that is where you have ensured that the time sign being generated are relevant and they are correct when you are identifying uh step n minus one step n and second and plus one finally uh if you look at uh the long duration steps so there will be steps which are taking say more than 20 30 minutes to run what we also want to do is we want to time into those particular steps to create further sub steps right so you have step one you have step two which is fairly bigger and then step three and step four as a visualization so what you want to do is you want to further break down step two into sub step one and subset two the idea is that you don't have large chance to comprehend in one group you have smaller chunks which can be passed through your nlp computer which we'll talk about when we are identifying the numbers or predicting the anomalies all right so finally just to visualize because not all business owners will be able to appreciate uh each step what we also do is we do a simple gantt chart visualization wherein we plot the steps we plot the subscripts within it so that we can see the sequential order of execution of each of the steps and subsets so that's why the next part is then focusing on anomaly prediction so how do you do that now once we have these steps into place um these jobs are run say on a daily or a weekly basis so what we do is we look at last 10 to 20 runs of the job right right here is 10 we also looked at 15 20 in some other particular cases and you compute some very basic metrics out of it for example if step one right now in the particular run is taking ten seconds how much time did step one take earlier right what is the mean of uh execution time of step one across the last 10 to 20 elections what is the standard deviation and so on so forth step one while the job is running right now is it uh taking a lot of time and compared to or taking very less time when compared to the previous entrance the second part is looking at content similarity so this is where uh not just looking at the deviation of the step you also look at what are the messages within the step right uh if you do if you have ever looked at a log file you would all obviously know that each step you will have either it's a step regarding the resource see the step regarding uh the script trigger it's either something about your resource being allocated or released or you're already broke down to multiple sub queries so instead also has a meaning associated with it this is not very intuitive in the first course but what you want to do with this particular process you want to identify what is the content in that step and then compare that what that particular step uh content would have been in the or was in the last particular state entrance so if step two is generally allocation uh of resources and uh and successful association resources so we see that that content being the same across last 10 months and we see what is the content for step two in the current run you compare those items to identify it as a definition right and both of these factors are considered together to establish whether a particular step is an anomaly or not another particular execution which is happening right now is that normally or not right so i just want to deep dive into this particular part a little bit more so it's clear so let's say you have a subset so you have step one step two and you're broken down to substitute one and subset two which was the outcome from the first uh item in our approach right so this one now what we do is uh there are two parcels so one is for each for the same from the last ten iterations we identify what was the time taking for step one so obviously i'm taking for example for say ten seconds and the standard deviation of that time was say around two seconds again so for step two how much time does it take on an average and what's the standard deviation and again question n uh what's the average time and what's the standard deviation time and then based on that you create some threshold so against mu plus minus three sigma you can vary we had a cons we looked at these constants to identify um uh like whether this particular pattern is working or not so that is a threshold which you use uh on the content side of it how do you do a content similarity so what you did there was simple stuff so we did text cleaning so removal of softwares we removed some punctuations uh we excluded non-english vocabulary words because uh if you put in anime the log file has a lot of these underscore words which uh really don't have any embeddings in most of their models and then we ensure that youtube's case consistent we use the all minion and six meter model uh to generate the embeddings the reason why we did this was because this model is uh tuned for content similarity it has a very low uh processing time and it has a very good performance you can see the details in a link which i'll share later and and what was done was we look at the average of embeddings for step two across last 10 months uh to create kind of a embedding reference for each step so you will have a set you will have an array or you will have uh an embedding matrix for each of the steps which can be use the reference on what the content should look like right now once you have these two so you for every step from the last segments you will have an expected duration so step one should be between four and six seconds and the step one should look like this in terms of the embedding matrix uh step two should have should be uh yeah so it should uh it should be either it should be between 11 and 41 seconds and it should uh have a embedding generation or embedding matrix as such so these are two reference points you will have uh what you will do is you will see that the step when the code is actually running or the job is actually running whether the step is uh crossing either the threshold or the similarities for the embedding matrix is very low when compared to the last n nitrations once you have this assessment uh you can use it uh to flag whether the step is the normally or not and once this set of steps stay you can essentially say that the job is going into an anomaly because a lot of time what's happened what happens is that the job gets stuck so the owners get only notified say after 12 or 24 hours of the job being stuck what this will do this will send a real live live alert even when the job is running that there are 16 steps in your job step number 14 is deviating way too much from the expected behavior from the last 10 months and hence we are going we are predicting that the job is going on right so then that is how the anomaly prediction model works uh the job failure and root cause analysis this is the last element of the last piece of it what we are essentially saying is uh once the job has failed and this is based on the flagging from the internal systems itself what we do is we have some we have some set of strings which is which relate to the types of errors in that particular job so if it's a syntactical error you will see a particular string so this is simple like there's no i don't want to get any fancy a set of streams for translation error this is a set of strings for resource usage for parents of failure and so on and so forth you also gave the flexibility to the data engineers to create their own custom errors which you generally see when they run their jobs and based on that once you have those patterns and once you have those uh embeddings or once you have those lists of strings you can do a root cause assessment time then retrigger the scripts accordingly so what we asked you can give us to do was if it's a syntactical error we can uh at our backing you can collect those syntactical errors very very quite quickly because it's either a missing bracket or there are some sequel statements missing validated before we read on it because we want the engineers to know what the device query looks like the second part is in terms of resource usage so if there's a resource usage we wait for the resource our numbers to go down before we retrigger those scripts and that is the second part which we do uh in the scripture trigger and if it's a parent job failure again we have identified for each eye job what's the parents of and we wait for the parents of execution to complete before we run the child job again so these are the set of re-trigger scripts which have been executed as a downstream process from the nomination model uh finally in terms of the methodology uh we had picked up a set of 20 jobs which had 1500 plus runs uh of it what we did was we tried to establish whether the model is able to credit an anomaly or not and again you see that our model was able to achieve a 90 plus percent recall close to 80 uh sorry it's not a 82 fast positive rate it was less than uh 20 percent false positive rate my back and uh for nearly 1.5 k right so that that is the uh that is the uh kind of performance which you observe from the particular model and uh the compute time was also fairly low so you will see that the compute can be 10 milliseconds which when integrated as part of an api uh was fairly low when compared to the other systems right so those are the uh those are the results from this particular module when we're trying to assist questions root because this is the most straightforward decision right so the the formula numbers are quite good from the model here so that's it so these are the two uses which i wanted to cover i think i'm right on time i have times for uh q a but i i hope you're able to relate to these use cases and understand the performance of this particular model yeah i'll give the mic back to the analytics uh team do you want to chime in or should i just start with the q a uh so now you can start you can start okay okay so uh the first question from um so real-time use cases are normally like in this scenario in which industry the nomination is crucial okay so real-time use cases are are especially important where the subsequent set of bouncing processes are immediately triggered based on the current event so say if you have an event x uh think of it as um okay let me give you an example from actually manufacturing if you look at steel manufacturing so there are certain set of processes which need to follow that at every step you need to evaluate the percentage carbon content in your state and that's an important metric or in your hot motion between when you're trying to establish whether uh what grade of steel you get right so the carbon content in your molten metal is a very important indication of what kind of uh steel will you get right now in each process at the end of every each step you will evaluate the carbon content now what happens is that if you get a very lower grade molten material at uh say at step three out of say 10 steps right all your remainings all your subsequent six steps or seven steps which are going to happen will work on a lower grade molten metal and what will that do like to essentially spoil the resources for the particular factory in in like a certain context each batch takes uh like generates requires at least 50 crores of raw materials so you're spoiling 50 crores of raw material if you don't share the normally quickly like say within 5-10 minutes so that is though cases where your real life uh or real-time detection is important and even in the case i talked about before if your particular process n fails and there are certain processes n plus one and plus two which are dependent on n these pulses will automatically fail because the position has failed right so we do a period like that being a model this process is carried out so anonymity detection can happen uh after the like i mentioned it's either a detection or a prediction usually you did that when all the set of events i think processing is pre-processing policing not applicating this particular but if you're trying to detect an anomaly you have to let all set of events which can happen before to happen after that uh or to essentially take sorry and if you want to predict uh obviously uh the event shouldn't have happened before you make the prediction so it depends on the context on which you're operating uh so why so we have a question from arun sir why is of this this is an application of ai and machine learning which is able to automate the processes uh the checks and implement a self-healing layer trigger a custom script based on what the essentially the root cause okay sorry and we have one more so is it an unsupervised task ah say it again so there are uh a nomination can be supervised unsupervised or semi supervised again it's context driven in cases we have a lot of training points and you have say your features which can help you predict that normally you can create say an isolation for this model and and predict our data point of the nominee or not right but in cases like the cases are talked about then it'll be moderate uh that should not really be a very predictive model but you have to create the features out of the nlp out of the log files or text and ensure that uh the subsequent data points follow a pattern for you quality and so that is where it becomes a semi super very fast right and in certain cases like the first case which i mentioned that when we just leave your data points you're not training any model as such it's completely unsupervised so it can be either one of them okay uh okay we have a question from rahul uh uh could you let us know how you approach a particular nomination in terms of which technique you start with first and the process you go through to arrive at the techniques that work best least uh complex okay so uh there is no one answer to this there is no one answer to this but uh the way um the way it's done is one is obviously before you start implementing a problem you identify the constraints you're working with the technique essentially is dependent on what are the constraints you're working with is is latency a problem is the type of data or problem is the context of problem right so what are the constraints you are working with what kind of data do you have available because each technique will be so if you look at an isolation policy you need a large number of training points to build your model but if you create a rule based techniques based on say previous and data points that is required so you will look at what type of data you have available what's the volume of data that is available how fast you need to make your predictions to establish and then filter out what and then you do a research to identify what techniques are available to train down what possible set of techniques might be applicable once you've trimmed down those possible set of techniques it's a matter of testing it and doing a very uh thorough assessment of what's the performance approach any kind of a problem uh guys please post your questions in q a q a not in chat box so there is one more question from alok shinde he wants to ask you that how to detect anomaly in text data in pdf okay so for text and pdf uh i might be a little bit awkward but there are standard messages which can read text from pdf files as well i would love to just go to a quick conversation once you have the text from the pdf file it uh you can just process that and hang on but there are certain cases where for example if you have a set of scanned pdf pages it becomes difficult uh and as where you are scanning handwritten files versus say text files there are again important practices you can process it the only thing you need to ensure that when you're reading those files and you're extracting content from the studio files the content is active and it's correct so how do you balance your model and what steps to use so uh if you can elaborate a little bit more on what you mean by model balancing and there are different definitions for modules i just want to ensure that i'm talking about the right context when i'm answering the problem are you talking about generalization of the model across different scenarios or use cases or is something else okay uh few failures lots of not failures okay balancing here okay so um see anomaly by itself are uh itself will have obviously low failure there are multiple ways to balance it out uh either you kind of if you're using for example say a isolation point so if you're if you're using a normal random forest for a particular example you can just do sub sampling you can achieve certain records and you may increase the terms of your ones that is one there are there's obviously there are multiple use cases where smoke is your which is synthetic sampling that is two but in this particular case what i would want to ensure is that the model i create it looks at failures and it looks at the volume of not failures all the successes as well so i would not do a lot of uh balancing when when i'm creating the model i will ensure that the failures are captured correctly and i would ensure that the features which i explained right so that's how i do like generally you would use more or you would do under something [Music] in this particular case i would limit the amount of uh under sampling i would do when i'm instituting this use case so uh another question i have from monali roy uh how does a norwegian compete decomposition predicted how does anomaly decomposition predicted modeling of data conjugate equipment in use case uh well i am not sure if i followed that question correctly could you type uh can you share how these logs were formatted what kind of information was logged along the time stamp because the log format decided earlier or you could actually process all the laws that were available so uh these log files uh like you would imagine any spam jobs they contain the timestamp they also contain the message in that timestamp so what happened on that particular timestamp was it um like job execution was it resource allocation was it something so it was along like say three or four lines of machine content which is not really readable in terms of did i have the log files available with me yes i have the log files available with me when i was creating the uh when i was finishing the approach and that is how i customize my uh data cleaning that is how i customize what model i'm using for inventing uh to generate those embeddings to ensure that when i'm creating the embeddings i'm really reflecting the content within that particular step okay that was a video okay i have an environment anonymous attendee who will ask me have you seen the cases the techniques are adapted and extended to bias detection uh yeah there are cases where techniques can be accepted to bias detection but uh i would need more context and what type of problem you're trying to solve generalized answers i think now we can rip up the session so thanks a lot paritos on behalf of analytics vitia i would like to thank you for devoting your time and for delivering such a great session i'm sure it was insightful very comprehensive and appropriate for lots of experience level hopefully we can conduct more session with you in the future thank you so much thank you so much

Original Description

DataHour: Anomaly detection using NLP and Predictive Modeling In this DataHour session, Paritosh will cover the fundamentals of anomaly detection and discuss its application in job management and exception handling. Rather than proposing a single approach, this session will identify the nuances that should be considered while designing an anomaly detection solution. The session will be aided by real-life examples that leverage NLP, predictive modeling, and other ensemble methods to identify, correct, and prevent anomalies. Prerequisites: No prerequisites. A preread of this article will enable familiarization with the core topic. 🔗 More action pack session here: https://datahack.analyticsvidhya.com/contest/all/ Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 2 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

The DataHour session covers anomaly detection using NLP and predictive modeling, with a focus on software engineering and product-oriented companies. The session discusses various techniques such as seasonal decomposition, contextual analysis, and simple rules-based approach, and provides examples of how to apply these techniques in real-world scenarios.

Key Takeaways

Trigger an email to notify business owners or engineering teams
Perform classical seasonal decomposition to identify anomalies
Take context into account when identifying anomalies
Use simple rules to determine if a data point is an anomaly after seasonal decomposition
Roll up raw data at a weekly, monthly, or daily level to identify anomalies
Break down large log files into individual steps using time stamps
Extract and align timestamps with the same time zone
Apply exclusive rules to ensure correct step breakdown
Identify root cause of failure and perform self-healing through automated means

💡 Anomaly detection using NLP and predictive modeling can be applied to various domains, including software engineering and product-oriented companies, and requires careful consideration of context, seasonal patterns, and data quality.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Unsupervised Learning

View skill →

How to implement K-Means from scratch with Python

How to implement K-Means from scratch with Python

K-Means Clustering - The Math of Intelligence (Week 3)

K-Means Clustering - The Math of Intelligence (Week 3)

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Self-/Unsupervised GNN Training

Self-/Unsupervised GNN Training

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Clustering with DBSCAN, Clearly Explained!!!

Clustering with DBSCAN, Clearly Explained!!!

StatQuest with Josh Starmer

Related Reads

Capacity Is Not Generation: Anatomy of a Convenient Energy Myth

Distinguish between energy capacity and generation to make informed decisions in the energy sector, understanding the difference is crucial for accurate planning and analysis

Medium · Data Science

matten: Heterogeneous data with `--features dynamic`

Learn how to handle heterogeneous data with matten using the --features dynamic flag

Dev.to · nabbisen

Compressor Oil Test Rig Data Acquisition: Closed-Loop Monitoring and Oil Condition Trending

Learn to design a data acquisition system for a compressor oil test rig using closed-loop monitoring and oil condition trending

Dev.to · Robin | Mechanical Engineer

Actuarial Science vs Data Science?

Learn how to transition from actuarial science to data science and leverage your math and statistics skills in new areas

Reddit r/datascience

How to Use VLOOKUP and XLOOKUP in Excel | Step-by-step Guide