Exploratory Data Analysis: Outlier Detection and Normalization
Skills:
Data Literacy80%
Key Takeaways
Outlier detection and normalization in exploratory data analysis using statistical methods
Full Transcript
hello everyone I hope you all are doing well this s priia and in this video we will continue our project where if you remember in the last video uh we wrapped up with the descriptive stats we have discussed its significance we have talked about correlation coefficient its significance types of distribution how median is more robust than uh mean for the outliers part per se and then we have seen the data imputation that at what point of time we should go for me mean and at what point of time we should go for median so uh at the end we made a conclusion that let's say if we will be having a symmetric data in that case it would be a better choice to go for a mean imputation part per se and if let's say we will be having a non symmetric based data maybe a skewed data in that scenario it would be a better choice to go for a uh median imputation right because of the reason of more robustness to the outliers in case of a median now in this session we will continue our Eda task where our focus is majorly on that okay let's uh we do have the complete analysis of a data we have already done the imputation as well but now let's try to understand ke let's try to examine key whether we will be having outliers in our data or not and if yes uh how can we handle those outliers again for this a bit of maths is required I will try to Showcase you each and everything just like the way I have done in the previous session while explaining the concepts of mathematical part here also so what we can do here is uh first of all we will try to see in the very first go in this session that how we will be able to detect the outliers so first task is the outlier detection now in order to examine the outliers at the very first go what I will do is I'll try to split my data into I would say input features and Target value because as of now if you can see all the values which is pregnancies glucose BP skin thickness till age I know all these are the input features that I have and outcome is something which is the target value so what if if I try to split my data into the input features and the target value so what I will do is maybe I will take X which indicates the input features and I'll take Y which indicates the target value so now what we will do is we will will try to split our data into X and Y so how can we do so a very simple use case basically what we can do is we can simply drop the values of the outcome right so what we can say we can say data frame dot drop drop will help me to drop the last column which I have or maybe any column whatever you want but in this case I just want the outcome column so I can say columns is equals to outcome and at the very end you just specify what is the ex's it's a column wise I want to delete the column name outcome so I can do that after that Y is something which is equals to data frame outcome specifically right so what I have done simply I have deleted the values of outcome in the X part so that I will be having only the input features and in the Y part I have just chosen the outcome so that I will be having a Target column available with me now X and why is with us what what's next what we can do here let's try to understand how can we detect the outliers right this is the task now which we have to perform so let me delete it from here and let me write it here outlier detection now there are multiple ways to detect the outliers but the very famous way or very important plot which everyone should know is box plot this is one of the uh you know crucial understanding which everyone should be having with respect to the outlier detection part because this is one of the famous plot which is specifically famous for outliers Det section only we do have vient plot we do have another ways like zcore empirical formula maybe in future sessions in future projects I will talk about those things as well but for now this is something which in Industry many people are using common way I would say so let's try to understand the maths behind that right so what I can show you is you can again see we are applying a fig size so it will try to create a figure size of 15 cross 15 I have used here subplots function because in within one plot I want all the uh box plots for all the features that we have and then inside this I have specified SNS do boxplot which kind of data so X is my data for which I'm looking for a box plot right and at the end I'm saving that as a box plot. jpg so that that will be saved so now here you can see a beautiful BLX plots we will be able to get here right now let me try to uh download this particular box plot that we have and let me try to explain you this particular plot what it is trying to signify then only you will be able to understand the maths behind this so let's try to understand the in-depth understanding of how this box plot is working what is this diagram indicates each and every Point okay let me open one Epic pen for uh the illustration per se okay now what we have to understand let's try to see that in a box plot you will see this is called as upper whisker this is called as upper whisker so as of now I'm trying to explain the concept uh with respect to glucose feature that we have and same concept is applicable for all the box box plots that we have for all the feature values that we have right after this you will be able to observe this is called as lower whisker I hope my screen is properly visible to everyone lower whisker this is lower whisker this is upper whisker now this is q1 this is Q2 this is Q3 q1 is nothing but a 25 percentile value which you will be able to get Q2 is nothing but median which you can say is a 50 percentile value and Q3 is 75 percentile value now what is the meaning of outlier simple meaning is that any value which lies above upper whisker or any value which lies below lower whisker that will be considered as an outlier for example in insulin if you will be able to see there are lot many outliers all these are the part of outliers okay similarly if you will be able to see all these are the outliers similarly you will be able to see all these are the outliers so with respect to every feature we can see there are lot many outliers glucose is the one which doesn't contain for now any specific outliers so can you see I hope it is clear now that in a box plot it is composed of five important values q1 Q2 Q3 lower and upper whisker right so if there is any value which lies outside upper whisker and lower whisker that will be considered as an outlier in terms of presentation per se you can consider these ball points which we have all these are the outliers in my data this is the way to detect the outliers now how can we deal with these outliers for that there are different different ways there are empirical formulas with the help of which we can handle that there are uh ways with the help of which we will be able to evaluate what is the value of upper whisker and lower whisker and then whatever values lie outside that range we will try to remove it completely there are multiple ways so I will try to Showcase you one of the way with the help of which you will be able to understand that how we will be able to deal with the outliers right so what we can do now is maybe let's try to understand the maths behind the calculation of how can we evaluate the values of Upper and Lower whisker for that let's take one example with the help of which you will be clearly able to understand so maybe I will be having a box plot right I will be having a box plot let's say this is a box plot that we have so I'm trying to display in the horizontal direction for now and this is something that we will be having as the lower whisker and this is something that we will be having as the upper whisker right so this is q1 this is Q2 this is Q3 this is lower whisker and this is upper whisker right so what we will try to do here is we will try to compute the value of Upper and Lower whisker so it is nothing but 1.5 into the value of IQR plus Q3 right and this this is q1 minus 1.5 into the value of IQR so this is the major formula to evaluate the value of lower and upper whisker now what we will do is we will try to apply this similar formula uh in our code to get rid of the points so what we can do once we will be able to evaluate that is there any point outside this range or outside this range we will be able to understand that okay all these are the outliers and we have to remove them now you can ask me one question for those who are not aware about the stats idea what is this IQR although I'm assuming that you are already aware about statistical Concepts but still I'm explaining in a very brief shell if still someone is facing struggle it means that there is a lag in stats Concepts please pay attention towards that you can just first of all focus on those Concepts and then maybe you can come back here so here IQR indicates inter quartile range inter quartile range whose form formula to evaluate is the difference between Q3 minus q1 so basically what we usually does is we usually try to calculate the value of IQR by looking into the difference between the q1 and Q3 and as I told you q1 indicates 25 percentile it is 75 and it is 50 so we will try to see the numbers between the range of 25 and 75 and we'll try to take the difference between these three these two values what is the value corresponding to Q3 and what is the value corresponding to q1 one perfect so what we can do now is we know we do have multiple columns right available with us we know we do have multiple columns available with us now I'm trying to Showcase you the code to detect the outliers here you can see we have multiple outliers available in different different uh features that we have so what we will do is we will write the column names that we have so we have pregnancies we have glucose we have BP we have skin thickness we have insulin BMI diabetes pedigree function and we have age as well so what I have done I have tried to note down all the values in the list name columns calls which indicates columns now what we can do we can Traverse in the entire list and with the help of this quantile function we will be able to get an idea that what are the data points that we have which lies among 25 percentile and what are the data points which lie among 75 percentile so if you want to look upon that what is the internal math behind calculating the quantiles which is 75 25 or 50 please look up the stats Concepts it is very very simple I'm not going in that much depth because the agenda is to provide you the use case or the project part per se of machine learning of logistic regression right if I'll go in that much uh depth with respect to Stats concept then the videos will be very very long right which I don't really want so that is the major purpose please try to have an understanding of these things but in a nutshell it is trying to convey 20 25 percentile index 75 percentile index values which are available so basically you will be able to get the value of q1 and you will be able to get the value of Q3 which lies among 25 percentile and 75 percentile now as I told you IQR is nothing but is a difference between the Q3 minus q1 so you have tried to evaluate that and with the help of which you will be able to get two values lower bound and upper bound or you can say lower visker or upper visker once you get that you will try to store all those records which are either above lower bound or below upper bound so all these are the mask values which you have taken so if there is any data point which lies among this range that is a point which you should include because if there's any point which is above lower bound or above upper bound obviously that is a point which which you have to avoid lower than lower bound and upper than upper bound all those are outliers just now we have discussed right so any point which lies among lower and upper bound is a points which is not a part of an outlier so what I have done is I have tried to store all those records which are not a part of an outlier so this will help me to reduce maximum outliers in a way but definitely apart from this there are multiple ways which we have to explore in order to remove them for example you can look upon the quantiles that if let's say as per empirical formula in stats if you're aware 95 % of data points are among muus 2 Sigma to Mu + 2 Sigma so what if if I'll take until 95 percentile value data points will it be able to remove the outliers or not so all these statistical things you have to do to improvise on top of this for now I'm showcasing this thing so let's run this particular code and now we have a mass records what we will do now is we will try to take only this Mass data and that is a part of our and Y one thing which you will be able to observe here if I'll just show you x dot shape so you will be able to see as of now I will be having 768 number of records and we will be having eight number of features even if you will see y do shape you will be able to understand we will be having similar number of records that we have but now let's say uh we have applied the outlier xcore you can say removed or removal after removal or maybe you can mention after outlier detection maybe as of now this is the naming which I'm am getting in my head but obviously you can rename it in a better way so as of now I'm writing this name right so we will be able to get our new data points which are after outlier detection we will be able to get if you will check the shape of these points right so you will be able to see now somehow initially number of data points was 760 8 for now it is 759 very small data points as an outlier completely removed right similarly if you will check for the Y outly detection Shi it will be also same so in a way we have reduced Q data points right but still there will be outliers available so you have to explore the proper appropriate range where there will be no outliers or I I would say don't focus on the 100% completion of outliers but yes maximum outliers you should be able to remove that's the ultimate Target you should have so this is one of the way in the upcoming session I'll discuss the second way also then Third Way also so the target is that in further projects we will discuss further further ways also so that in every project we will learn something new this is one of the way right now this is all about the discussion of the outliers detection where we have explode towards the box plot again we have violent plot also which does the similar thing but as of now we have explored towards the box plot and how with the help of that box plot with the knowledge of that how we will be able to determine the lower bound and upper bound and how we will be able to get only those records which lie between these ranges that's what we have discussed second important thing which we have to take care of is the concept of standardization or normalization of values now what is the issue with these points in the box plot you will observe in this picture the values if you will be clearly able to observe here the values that we have right the values that we have some values are very low some values are very high some are again very high some are again very low this is a kind of format which creates biasness to the model why because your model will assume okay let's say I will give you one simple example let's say I have two data points one is age and one is population you will be having one data set where there are two features one is age one is population now you know age can not be more than 100 maybe some people may have 105 110 but not more than that right but population can be among millions in cores so what is happening is if let's say you will be having some data point where you will be having two features one feature is having a specific range which is very very less another feature is having very huge amount of range so what will happen internally your model will be biased towards one feature it will assume AA this feature is having a maximum priority as comparable to the Agee one but this is not the scenario I don't want this thing to happen so for that there is a concept of standardization there is a concept of standardization to remove this particular uh biasness that is where we will learn now that how basically we can remove this kind of a issue for this we have two methods in machine learning for this we have two methods in machine learning one method is called as uh you will obs OB the standardization another one is minmax scaler so how can we do that let's try to understand that concept as well now as it is a part of a pre-processing so what you will do you will try to use sk. pre-processing tool you will import the standard scalar and after that you will apply fitore transform method now what is happening internally it is trying to convert your data into a standard normal form again a bit of ma maths or I would say stats is there standard normal form what is the meaning of this the simple meaning is that a form where the mean will be equal to zero and standard deviation will be equals to one will be equals to one okay so what you're trying to do here you're trying to do a pre-processing you're trying to import a standard scalar here and then you're using a model of standard scalar and you're trying to fit your model so once your model will be fit now what will will happen is once you will try to apply a box plot on this updated data which is I think we need to do for xcore outlier detection because that is the data which we will deal now okay and then maybe I will say the new name as xor scaled so let's do it again so now our data is scaled right so now if I just show you the scale data once again how will it looks like you will see the difference now so what I'm doing is I'm trying to use this new data which is xcore scaled just try to see this data can you see all the data points are in one specific range all the data points are in one specific range and still if you will be able to observe there are many outliers which we have to get rid of right which we have to get rid of so maybe another way could be to use that quantile approach which I was talking about right so with the help of which we will be able to understand how can we get rid of that that outliers but as of now I think the major Target of using the standardization is pretty much clear to everyone where what we are trying to do here is we are trying to define a similar range of values to each and every data point if you want to see the descriptive stats of this xcore scale data point if I'll use describe function here you will be able to understand that uh what's the issue xcore scaled I think this is the name right right that we have I think this is not a data frame that's why it is it is a issue so maybe what we can do is we can create a data frame first maybe I can say xcore scaled is equals to pd. data frame and inside that again you can mention xcore scaled and then I think it should work yes it is working fine so basically now you can see our mean is approximately equals to zero now don't say it is not zero you can see exponential to the power power minus 16 so it is ranging towards zero only and you will see standard deviation is approximately equals to one that is the literal meaning of doing a standard scalar another wave could be minmax scalar where it will try to take the values of minimum and maximum and will try to do the conversion any one of the way you can use it okay so this is the overall idea behind the standard normal form or I would say standardization now in upcoming projects you will observe I will not use standardization I will maybe use the concept of minmax scaler although the concept is same we just have to import minmax scaler and you can do your own research as well but for the different idea perspective per se I will try to Showcase you different Concept in the upcoming video I will also try to Showcase you that okay we have seen how with the help of this knowledge of this lower and upper whisker or lower upper bound we will be able to reduce somehow the number of outliers but still there is a lot more which is pending right so what is the second approach which we can use in order to avoid this outlier thing that we will discuss in the upcoming session so in the part three of Eda two things we will discuss now one thing is that further second way of removing this outliers right so we will discuss approach two of quantiles to remove the outliers until so far uh we have talked about only first approach which is with respect to I would say right with respect to we have talked about box plot knowledge second thing which we will discuss is imbalancing of data how we can handle that handling of imbalanced data but before I'll wrap up this particular video one important and interesting thing which I really want to talk about here is if you will be able to observe if I'll ask you that you know uh how we will be able to again detect the imbalanced data first of all what's the approach behind that so if you can see we will be having y right if I'll say y do Valore counts let's see what we will be able to get here can you see it is this value underscore counts function indicates that how many number of values are zeros and how many number of values are ones you can clearly see here that we do have 500 zeros and 268 ones but if you will be able to see we are not dealing with a uh you know actual data we are dealing with the updated data which is Yore output detection so let's focus on that point per se because now we are just having 756 I believe number of points 759 right so let's try to see with respect to this data how many values are there so you can clearly see that number of zeros are almost double than the number of ones which is a major major issue as an ideal case when you're dealing a classification problem your number of zeros and number of ones should be approximately equal if there is a difference also it should should be let's say 10% minimum but here almost 50% data points are more in class zero which makes sense also you will observe especially in healthcare data that's a major issue in real life why because the patients who are genuinely in a problem for example let me take a very simple example the data available for noncancerous patient is definitely more than the PTI the cancerous patients I hope it makes sense right similarly the data which is available for diabetic patient will be less as comparable to non diabetic people in the world so that is why that is a kind of a issue which is coming up so we will discuss how we can handle this issue so now as a wrap up what we have discussed in this video concluding part as a concluding part we have discussed majorly two things in this session first thing is detection of the outliers how we can detect that and we have seen how we can handle that with approach number one second thing which we have discussed in this particular video is that how we will be able to do the normalization Via standard scalar form where we are trying to convert our data with a mean value zero and standard division value as one right normalization by standard scalar form makes sense and we have discussed why it is important why it is important so that we will be able to reduce the bias bus in the model I hope it makes sense makes sense so this is the idea I don't want to make it forther big session maybe in the upcoming session that will be the last part of the Eda if you can observe how detailed in know detailed since I'm trying to convey this Eda Concepts the only reason is that this Ed you will observe in the companies also when you will go you will spend 90% of your time in the Eda part itself because the higher quality data you will be able to generate there are better chances that you will be able to have a better accuracy right so please try to understand all these Concepts in a very great manner sometimes you might feel that there is a lag in the uh sessions because I'm using some statistical Words which might be jargons for you the only uh thing which I will say in that particular note is ke please try to have the knowledge of stats concept somewhere if you're feeling ke I'm not able to understand that statistical part maybe you can uh take the help of other resources where you can understand the stats Concepts and then maybe you can come back and try to understand what I'm trying to convey there that's the only thing which I feel you might miss if you're not aware about the statistical Concepts because in Eda stats is very very important and here I'm not covering up stats in that much detail because of the limitation of time that's the only thing right with this let's end today's video I'll see you all in the upcoming video where now we will disc discuss the further part which is pending and then we will discuss in the upcoming sessions how we can do the final modeling and prediction of the logistic regression model which we want to cover up here so I hope that you are really enjoying this sessions a lot and I'll see you all very soon in the upcoming video where we will proceed with the Eda further
Original Description
Explore Premium LIVE and Online Courses :
https://practice.geeksforgeeks.org/courses/
Follow us for more fun, knowledge and resources:
📱 Download GeeksforGeeks' Official App: https://geeksforgeeksapp.page.link/gfg-app
💬 Twitter- https://twitter.com/geeksforgeeks
🧑💼 LinkedIn- https://www.linkedin.com/company/geeksforgeeks
📷 Instagram- https://www.instagram.com/geeks_for_geeks/?hl=en
💌 Telegram- https://t.me/s/geeksforgeeks_official
Also, Subscribe if you haven't already! :)
#GeeksforGeeks #Learntocode #GfG
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from GeeksforGeeks · GeeksforGeeks · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How I got into Walmart | Shailesh Sharma
GeeksforGeeks
Upgrade yourself In 29 Days | GeeksforGeeks
GeeksforGeeks
Learn AWS Fundamentals For Free
GeeksforGeeks
Conversation With Young Achievers | Meet the winners of Bi-Wizard Coding Contest | GeeksforGeeks
GeeksforGeeks
Meet The Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Interview Prep Strategies | PayPal
GeeksforGeeks
OLX Interview Preparation Strategies | Hukam Singh
GeeksforGeeks
Meet Some More Winners Of Bi-Wizard Coding Contests | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Microsoft Azure For Absolute Beginners
GeeksforGeeks
Python for Data Science | Data Science Master Bootcamp | Arpit Jain
GeeksforGeeks
Getting Started with Data Analysis | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How to prepare theory subjects for SDE interviews | Geeks Summer Carnival 2022
GeeksforGeeks
Get Your Tickets To The Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
TED Talk Data Analysis Project | Data Science Master Bootcamp | Ashish Jangra
GeeksforGeeks
How I Secured AIR 9 in GATE'22 | Tushar
GeeksforGeeks
Learn Java Backend Development | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
How to Recognize which Data Structure to use in a question | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Learn Data Structures and Algorithms | GeeksforGeeks
GeeksforGeeks
Interview experience at Flipkart | GeeksforGeeks
GeeksforGeeks
Lets Prepare for GATE'23 the Right Way | Sakshi Singhal | GeekSummerCarnival
GeeksforGeeks
Highest Paying Jobs in 2022 | Ishan Sharma | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | 5th April- 11th April | GeeksforGeeks
GeeksforGeeks
Preparing for SDE interviews | Soham Mukherjee | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Full Stack Development with React & Node | Utkarsh Malik | Geeks Summer Carnival | GeeksforGeeks
GeeksforGeeks
Introduction to Open Source and Roadmap to GSOC 2022 | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Web Scraping in Action | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Getting Hired at BITCS via GfG Job Portal | Get Hired With GeeksforGeeks
GeeksforGeeks
How to build a faster landing Page | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival | 5th To 11th April, 2022 | GeeksforGeeks
GeeksforGeeks
How to get ideas for Startup | Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Journey from Tier 3 to JusPay | GeeksforGeeks
GeeksforGeeks
Geeks Summer Carnival 2022 | GeeksforGeeks
GeeksforGeeks
Dispelling Myths and Pre conceptions of Programming Languages
GeeksforGeeks
Must Do System Design Questions
GeeksforGeeks
Understanding Sorting Techniques in an hour | Keerti Purswani | Geeks Summer Carnival
GeeksforGeeks
Get Hired at NEC | Job-A-Thon 8
GeeksforGeeks
Journey from Tier 3 college to Microsoft | GeeksforGeeks
GeeksforGeeks
Get Hired with GeeksforGeeks at SuperK | Job A Thon 8
GeeksforGeeks
GeeksforGeeks: Redesigned
GeeksforGeeks
From Tier 3 to cracking multiple interviews | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Youtube Data Analysis | Ashish Jangra | GeeksforGeeks
GeeksforGeeks
DSA Self-Paced Course Preview | Sandeep Jain | GeeksforGeeks
GeeksforGeeks
GATE Live Classes | Prepare for GATE CS 2023 | GeeksforGeeks
GeeksforGeeks
Journey from JIIT to Adobe
GeeksforGeeks
Life Is Unfair Ft. Shonty badmash | LIVE Discord Session | A GeeksforGeeks Exclusive
GeeksforGeeks
Interview Experience at Google | Tech Dose
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Interview Experience @ Amazon | GeeksforGeeks
GeeksforGeeks
My journey through the tech world from India to US | Vidushi | GeeksforGeeks
GeeksforGeeks
Complete Interview Preparation Course | GeeksforGeeks
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
Getting Hired at FiftyFive Technologies | Job-a-thon 9.0
GeeksforGeeks
GFG Karlo, Ho Jayega | GeeksforGeeks ft. Khaleel Ahmed
GeeksforGeeks
How I got job offers from 2 big companies : Arcesium & Microsoft | GeeksforGeeks
GeeksforGeeks
LINUX for Beginners | GFG x Itversity
GeeksforGeeks
My interview experience at Walmart | GeeksforGeeks
GeeksforGeeks
Get Hired at Speckyfox
GeeksforGeeks
Live Mock DSA
GeeksforGeeks
More on: Data Literacy
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Python for Data Science — Probability Basics for Data Science
Medium · Data Science
Python for Data Science — Probability Basics for Data Science
Medium · Python
The Attention Economy: Your Attention Is Worth More Than Gold
Medium · Data Science
What I Learned Building a Tableau Dashboard for Deloitte’s Data Analytics Simulation
Medium · Data Science
🎓
Tutor Explanation
DeepCamp AI