Detecting Outliers in Your Data With Python | Real Python Podcast #208
Key Takeaways
The video discusses detecting outliers in data using Python, covering libraries, techniques, and applications in data science and analysis. It highlights the importance of outlier detection in various fields, including finance, industry, and network security.
Full Transcript
welcome to the real python podcast this is episode 208 how do you find the most interesting or suspicious points within your data what libraries and techniques can you use to detect these anomalies with python this week on the show we speak with author Brett Kennedy about his book outlier detection in Python Brett describes initially getting involved with detecting outliers in financial data he discusses various applications and techniques in sec security manufacturing quality assurance and fraud we also dig into the concept of explainable AI and the differences between supervised and unsupervised learning today's episode is brought to you by apil layer.com your goto API Marketplace for seamless integration and reliable apis all right let's get started [Music] the real python podcast is a weekly conversation about using python in the real world my name is Christopher Bailey your host each week we feature interviews with experts in the community and discussions about the topics articles and courses found at real python. after the podcast join us and learn real world python skills with the community of experts at real python. hey Brett welcome to the show hi thank you for having me yeah so Christopher Trudeau who's writing a book for uh Manning also reached out and said that you might be interested in coming on the show to discuss outlier detection in Python and I was like oh that sounds really cool I don't know much about it I thought that would be fun to have you come on the show and discuss it so maybe you could describe a little bit about the book that you're writing sure yeah so it is a book about outlier detection kind of generally well the focus of the book is on tabular data so we get a little bit into time series data image data tax data some other modalities a little bit but the focus of it is working with tables of data and trying to find the interesting records in there the the nuggets the the sort of values that in there that are interesting for one reason or another they might indicate an error they might indicate fraud or or just some sort of something new and interesting in the data yeah has this been a long process like why did you get interested in writing the book uh well my working with lier detection certainly been a long process I've probably been about okay seven or eight years working with that the book itself is yeah it's probably about a year yeah I mean it is a major commitment just the amount of time you spend thinking about uh outlier detection and you know com up with good examples of of everything and you know I reread I don't know dozens probably over a hundred papers just to make sure I wasn't saying wow saying anything in incorrect in there yeah yeah yeah anyways it's something I was happy to do because it's it's just something I've I've long found really fascinating it's just an intellectually interesting area of machine learning so something I was Keen to do yeah yeah so you mentioned you've been kind of focused about seven or eight years maybe you can talk a little bit about getting into that and maybe that relates to what you do for your day job and how Python's involved yeah well I've been in software for probably about 30 years or 31 or something so okay one company I worked with several years ago my job kind of gradually morphed into being more and more data science work machine learning work till eventually it became my full full-time job and I was managing a research team there uh so it's about 10 of us that were working in the team doing work in a lot of areas you know relate to machine learning in one way or another but probably our our number one Focus was outlier detection so we just a lot of us just spent a lot of time thinking about what it means to be an outlier and what it means to be an an interesting outlier and how how can you trust a tool that indicates you know these are the most anomalous records in in a data set that so it just really interesting um but difficult questions we're working on it sounds like it's a mix of data and your focus was a little bit in the concept of finding outliers is is that right that's that's right yeah we we were doing a little bit of outlier detection on actual data sets but it was more as a a means to an end what we were doing was building a general tool that uh we can give to our clients that say you know you run this on your data we won't be there no data scientists will be there but if you you run this on your data it'll tell you what are the most anomalous records that are there the most interesting records oh and there's just a world of difference I think you a lot of people would probably have found this too between you know doing machine learning or you know something along those lines on you know when you have one data set you're you're analyzing it you're trying to find what's interesting in there versus trying to create a general tool that can accept any data set yeah I can imagine yeah yeah so it just kind of forces you to to think kind of more big picture in a way that's more robust and more meaningful and so well the the audience we were developing for in that job were Financial a s so the idea is if if you're auditing someone's um finances you're looking at a lot of things you're looking at their contracts their invoices sometimes emails there's different structured and unstructured data you would look at one of the things you look at and probably spent a lot of time on is their accounting data so their bookkeeping data like their sales purchases things like that sure and so one of the things Auditors will do is spot check those because you know if you're if you're auditing the compy for over a period of say a year which is fairly common yeah a large company might have millions of transactions now it might be a small umbrella company that only has 10 transactions in the whole year but a lot of companies they just have millions or hundreds of millions of transactions as an auditor you can't check them all so one the thing you do is you spot check them and you can check a set randomly and there's some Merit in that but what you can also do is say you know take the ones are the most anomalous the ones the most usual and check those those you some for some test those are your priorities okay with the idea that if there are any errors in your data or if there's any fraud or there's inefficiencies violations of certain protocols you have in the company assuming those are rare they should stand out as outliers yeah so yeah that's interesting that you mentioned the idea of before using technology to do this somebody who's a bookkeeper as a general term would go through and just sort of eyeball things and I think we've all done that before and maybe just spot checking our own work and and seeing things and saying oh that that that looks like an error that looks incorrect or whatever yeah yeah so you don't maybe think of it consciously that way but you're part of that is doing outlier detection and you know one of my first experiences with outlier detection which is not a good one was when I was in well my early undergrad but I was uh my dad had a company and okay well what ended up happening is we realized that some of his sales people he hired were uh ripping him off oh okay and how how I found that out was was just there was unusual patterns in their sales yeah and also what was really unusual was the commonality between a subset of his salespeople their their sales were much more ins sync than was normal yeah and and so I kind of and this is this is actually early '90s so predated using a computer for this but so I you know plotted It Out by hand and like oh my gosh like this is an anomaly this is there's something well at that point we kind of have suspicion it was fraud but in any case we knew there was something really anomalous happening yeah yeah definitely so so I say it wasn't good but it's also better than the alternative which was not noticing this and allowing it to persist yeah something I think you mentioned in the the book especially in the financial industry you ran through some numbers uh and percentages just like how much fraud if you will gets through it's it's unbelievable and well what's interesting too is just plain errors you know that with no fraudulent intent uh dwar fraud yeah so you look at the numbers for fraud and they're like you know your head spinning and then you you say oh my gosh but in errors are much larger than that so you kind of imagine how many errors there are and we see this with all it's not just business like scientific data and you know so much um data we work with it's just unfortunately riddled with errors you know in even cases where you think well there's not really a lot of opportunity for ER like you just a place where this is applicable quite often is reading D Clum sensors yeah I thinking one yeah well sensors have errors and and yeah sure they get out of like temperature just yeah a bad like soldering connection or something like that something like that yeah well temperature is a good example too cuz some of them can only read up to a certain level and then they okay they start failing and producing nonsense and yeah a good way to test that is just look for anomalies just say well who temperature just jumped from yeah or dropped from like you know like 70 71 72 and then just drops to like 40 That's not correct yeah exactly yeah that's what I was thinking about I think very often when people think of outliers maybe in a statistical sense that very often there's this process of well we want to remove those to make sure that they're not pulling the data average one way or another Y and so there's this process of like well what you or I don't know set rules or whatever you want to say like these are the ones that I I feel like we can remove and I feel like a lot of what you're looking at with the book is the flip of that it's like no no no that's the interesting stuff very often yeah yeah I mean and it depends what you're doing but well I mean where that really come that kind of thinking about outliers in that way is just something to remove especially if you're doing like linear regression or or some kind of a model like that yeah but I would say like in a prediction model sometimes you want to remove the outliers and sometimes you don't and so part of what's relevant is trying to understand the nature of the outliers so like one example is if you have a say a table of a people so you have some in there that some people in there they uh and say there's feature for their height so some people in there there there's we have a height of 7 foot okay so that's probably an outlier assuming here don't have too many that tall right there could be some people in there that recorded as 70 feet like 70 feet tall so that you know is a it's it's a DAT artifact that's wrong no one's 70 feet tall decimal point whatever yeah it's the height in inches or centimeters or or just yeah someone added a it's just a data entry error or something like this that you probably do want to remove because that's that's that's wrong yeah okay but the cases where the people are seven even though that's unusual you probably do want to leave it in got your basketball players whatever yeah yeah yeah and and whatever patterns you're trying to find May assuming the patterns are remain just as correct for them they're adding signal to your model not like if it's a decision tree based model for example you know decision trees can't go outside of the data that they're trained on so if you don't train on some people that are s foot tall oh we try and predict on them well the model won't know what to do there's no destination for it to land yeah so okay yeah so sometimes it's it's worth removing data and and sometimes it's not but but yeah to your point earlier yeah very often yeah these are the most interesting points in your data there're so a few things that you mentioned there we can kind of maybe step back to like you started to talk about two things like what are the industries that are interested in this or in the tool that you you guys were developing uh you know we mentioned Financial stuff a few times maybe there's a few different areas there like I I I don't talk about this much on the show but I worked in credit risk okay and so that was like part of my job at the bank was to if you will clean the data to to to make sure for those things you were mentioning like you know okay is this a fat finger thing or you know what is what is this anomaly that I'm seeing kind of coming through and it was a super manual process and then it was very manual like literally calling other departments and I'm not allowed to touch the data it has to be done from them and there's like all these kind of rules behind how this has to happen so I kind of I'm familiar with that area but I think you were mentioning a few other industries that might be interested in a tool like this yeah I mean virtually any industry that works with data has a lot of electronic data definitely financial industry so it's used it's Ubi it's used a lot yeah guess another case I kind of hinted at maybe was um industrial processes where you have you know machinery that you're you're monitoring and say like an assembly line or anything along those lines okay anytime you hit a situation that's unusual it's probably a problem yeah if sensors reading very high or just unusual temperatures unusual vibrations unusual pressure noise that sort of thing monitoring websites it's it's very useful so bot activity things like that b activity um hardware issues okay not able to get through yeah yeah it could just be your site's more popular than it used to be and that could be a a problem or it could be a good thing but yeah network security it's used a lot okay same idea as fraud you kind of working the assumption that most activity is not malicious when you have an attacker they're they're going to be behaving in some way that's unusual U just by definition they're they're doing something that's not normal if you examine their behavior from enough different angles you you'll find where they they stand out as as being kind of typical yeah yeah scientific data is it's well one one place I kind of covered a little bit in the book was in astronomy for example ah okay yeah so uh it's used in a number of places but just as one example we have telescopes all over the world now and telescopes in space and we're we're collecting huge amounts of yeah data like image well yeah it's almost infinite the way that those things can point and and look and the depth they can go yeah and the range of spectrums that they can check it just so there's the visual Spectrum but they go way outside of that so just the the amount of data that they're collecting they want to make sure they're focusing on things that are important you know that are potential yeah features yeah exactly yeah and important often means uh novel something new something we don't yeah well ideally haven't seen ever before but at least we've seen only rarely yeah so that's not thee one example I give in the book which is kind of you know I give a lot of examples that are fairly normal like along along these sorts of lines where you kind of think about ler detection being used so yeah credit card yeah fraud issues and you know just data cleaning and issues like that but one one place you uh and I gave this just as an example of somewh it you don't really expect to use outlier detection but when you when you think about it say well of course that's going to work work well it was a a study Black Rock did examining securities so the idea is if you're I think was stocks but yeah or mutual funds so if you're exam yeah I think it was stocks so what's often done by analysts is if you're examining how how well a stock performs you you create segments of the market so you're comparing like with like so you know you're comparing coke with Pepsi or something like that okay it's supposed to comparing coke with like a chain of fitness clubs or something something like that so it's important to have good segmentation for this to to be meaningful so you can compare see if you want to assess how well a stock has performed you want to compare it to stocks that are are similar to each other yeah like likes with likes you know like you yeah categorical stuff yeah exactly so so I think you know I explain this well this is nothing to do with stocks is anytime you do segmentation one way you can check you know how good is my segmentation is you say look at each segment and then look at each item within the segment and how unusual are the items relative to their segment what they found is that you know Morning Star so some organizations had organized the these collections of funds into certain segments and they found that some items were actually fairly anomalous compared to the segment they were placed in but if they put them in another segment the average level of outlierness was was lower so but anyways it just kind of means it's a way to evaluate how good your segmentation is and anytime you're you're dividing up your data that's interesting yeah because I think for like somebody who's creating a let's say a fund that's combining a bunch of different things they would want things that move slightly differently they you know the idea is that you want winners and losers you know if there's going to be losers at all in there you don't want them all to turn at the same time and so that segmentation would be critical yeah yes if you're looking to get diversity within a fund yeah having some outliers in there is is a way to do that and if you want to compare that fund to other funds you want those that set of funds that it's compared to that you know cluster or segmentation of them to be internally consistent which means you you don't want outliers in that sense okay yeah yeah so so when you're comparing like you know Apples to Apples there's no oranges mixed in there that are you know outliers and yeah not fair to compare to but within your funds yeah yeah yeah [Music] yeah need powerful apis to boost your business check out apil layer.com from scraping Finance to weather data API layer offers reliable and easy to integrate apis for all your needs trusted by developers worldwide at companies like Microsoft HubSpot Airbnb Samsung and more and just for our real python Community use the code real python all caps no space for an exclusive discount 50% off for 3 months on 100 API plans visit apil layer.com today and discover how their apis can transform your projects that's apil ye [Music] r.com so I guess that kind of moves me back to the idea of for the book itself like who would the intended audience be of the book like what level of python user and um like who do who do you feel like you're writing the book for um well primarily python users really anyone doing well data science or any kind of analysis of data okay any anyone's working these industries I've mentioned but you know virtually any industry where you have large amounts of electronic data there's value in understanding the data there's there's value and understanding what's in there yeah so yeah the book I think really anyone doing machine learning or data data science the these outlier detection is a good thing to to know it's one if you find supervised machine learning interesting doing like classifi regression models things like yeah you'll find unsupervised machine learning interesting as well it's just a just just an interesting area and it's it's it's practical to learn it'll come up in your your job once in a while it's like it's like learning uh clustering for example I mean it's just maybe not what you do all the time but it does come up once in a while and outlier detection does come up once in a while so yeah I've tried to write the book so it really anyone with a you know a decent background in in Python so knowing say Panda's numpy okay it's a basic psychic learning yeah lot a lot of the examples the the book sometimes just with the assumption that you know a lot of the readers are familiar with doing uh prediction problems so regression classification I'll kind of liken it to that because some a lot of places where light detection is quite similar in this a lot of places where it's a little bit different so I'll highlight that now having said all that I mean if if you're working in data science and you primarily use r or some other language I think there's a lot of value in in the the techniques yeah the ideas are the same and you know one of the things that's interesting about outlier detection is the algorithms that we actually use for outlier detection these are they're called detectors so it's little like in um you're doing classification you use a classifier and doing outl detection you use a detector the algorithms for them most of them are really pretty simple so I me lot of cases I just I just give the source code for them and you know they're not too bad to get your your head around and when when you read them you say ah that makes sense I know that that's that would be an effective way to find uh what I'm looking for so if you come if you're working in r or some other language I I think that would be uh you know just the ideas would would would still be useful but yes uh the source code examples are in Python the libraries I point you to or in in Python that's nice because that we talk about that often on the show the idea of python being very readable so you can do what you're saying of like going through the source code and and being able to look at it and understand the moves it's trying to make uh without it being too deep that sounds good as a good way to kind of get in yeah would you be comfortable describing the difference again my audience kind of varies as far as like their range of how long they've been doing python but how would you describe the difference between supervised learning and unsupervised learning oh okay yeah so outlier detection example of unsupervised learning well supervised learning you have a Target column you have a what's we us call the Y column so we take the example of a table of data so it's the same idea if you're working with a collection of images or collection of audio files or or something like that okay but if you have a a table of data if it's a supervised problem then you're you're given a y column and this is the column that you're you're learning how to predict from the other columns okay with un supervised machine learning there is no target there's nothing specific that you're trying to learn how to predict you're just trying to understand the data you're trying to find you're kind of going to the the basics of data mining well I would say you know you're trying to understand a data set there's probably two main things you're trying to to find in the data um it's a little reductionist but I think that's okay at a high level it's it's probably a fair generalization you're trying to find the general patterns in the data and you're trying to find the exceptions to those okay there's a number of ways to find the general patterns in the data you look for you know clusters you can look for sort of you know relationships you have between the the different features yeah trying to and then you're trying to find the exception to those so that's that's the outliers yeah yeah I feel like that's a really common process maybe along with cleaning the data which is always the biggest like thing initially is this idea of sort of exploring the data and and just like what what's in here you start start to do maybe a few different graphs and charts and stuff like that just to kind of see how things are related and and what's happening and and that's again where you might just start to see some of this outlier activity yeah it could be just a handful of rows that are in the data that's a little bit different okay yeah if so if you have image data for example well the simple example we always tend to work with is you know pictures of animals it's easy to understand it's easy to picture but so if you have say a thousand pictures and they're if they're labeled they might be labeled probably the type of animal that's in there so dog cat horse and the like okay so in that example a supervised problem might be you know trying to learn to predict given the picture what what type of animal it is so if you're given a if you're later given another picture that's you know same type of picture but it's missing the label okay you you say well I build a supervised machine Learning System a classifier that will given a a pitchure will learn to predict what the type of labeled is yeah and and that that'll often work but it's won't if you give it a another type of animal I say a platypus or something that was never it never counted during training it won't know how to to deal with that and it's just going to predict well the closest match of that so maybe say I think that's a cat or something but it's not it's it wouldn't it wouldn't grab it certain features and say okay well that's got the the bill of a duck or tail of a beaver yeah but it's not going to be that smart it might just say okay I think it's a duck yeah okay and so an outlier detection is some detection process something would say well wait a minute this picture is just different this this is type an let flag this one yeah have yeah I'm just say that uh there could be some some purpose to doing that but one one application of that is with self-driving cars because okay they're they have cameras that are looking in all directions and trying to figure out what it is that seeing which means they're basically running the images that they're picking up through a classifier and they're saying okay that's a phone pole that's a billboard that's a pedestrian that's another car and the like but the classifier supervised system like that won't be able to know when it encounters something that's novel that it's it didn't see during during training yeah okay so that's where an outlier detection system comes in into play because it says okay we're predicting this as a a foam pole it's maybe not a phone PA yeah yeah so an outlier detection say if it says let look this is this is really anomalous this is not like anything we saw during training was what called out of distribution this looks like it's out of out of the distribution of the data we used for training so looks like that that means the the self-driving car knows to be more cautious to be more careful maybe there could be some fail safe mechanisms that kick in at that point so yeah I um just traveled to Phoenix where spent most of my life living my my a lot of my family's there and so we were visiting I don't know if you know about this but uh that's kind of the biggest city that wh MO is really doing stuff in yeah and I had a conversation with a a friend who while they were doing the training a lot of the on the road training with it uh or maybe reinforcement or whatever you want to call it they offered if you were interested to be sort of a a test writer if that makes sense you could basically get lots of free rides on wh but what they were doing during that process there would always be still an operator person in there to take over whenever they do stuff but he would notice as a as writing that oh new software release and the car would be trying things oh and the person would have to take over you know it's like okay well you know it's trying to do this or it's trying to do that or they're going in different areas and interesting because he was able to like to sit and sort of watch it do some of the the different training I wonder yeah of course this is maybe a year ago okay longer the systems often there's kind of like this explore exploit tradeoff where they yeah they tend to do more exploring in the in the early version and more exploiting in the later so maybe yeah maybe every time there's software outdate it just goes more into Explorer mode yeah Phoenix there's not a ton of weather there's occasionally you know but it's not like a a city that has snow or or that much rain or other kinds of things like that the worst is maybe dust storms from time to time so it's kind of a good City to to I guess practice on but it was it was just shocking how many cars there were the place that we were we were having dinner at it was like every fifth car that was driving around um was really a driverless car way more car it was shocking how many there were I did not know it was that common now it was that there it is it was just like there's just so many of them and of course they were just you know potentially patrolling because it was a tourist SL you know like people out to eat and so forth like kind of like taxi drivers would be you know around an airport or something but it was it was very interesting to watch that's interesting you use a term I haven't heard this term and I was wondering what it was X AI or explainable AI technique oh explainable AI yeah yeah what what is that okay well that's used a lot in prediction okay models primarily and part of what I I did in the book was kind of explain how they can be used throughout ler detection as well so the idea is if you have if you make a prediction maybe like in your example if someone's looking for a loan at a a bank and you're trying to figure out well what kind of a risk are they of defaulting okay you can create a model say that's been trained on a lot of other people and whether they paid back or not and then you can run the model and if it's a neural net or a boosted model or something like that like a cad boost or XG boost or something like that the model could be quite accurate but you you don't really know what it's doing or why the black box of is it's a black box yeah so it comes back and say it says well there's uh you know 71% chance they'll pay back within seven months okay and so it makes a prediction but you don't know why and you don't know if it's making this decision partially based on race or gender or something it should not be using right right you don't know if it's accurate in all situations you don't know where when you trust it and there's just for certain models you know it's fine to have a blackbox model like you you have a website and you just you're just trying to predict okay which ad for a t-shirt should I show this client this visitor to the site okay you know if the model is right or wrong or it's biased in some way it's it's not I mean you might there might be a loss of Revenue but there's not like you know it's nothing immoral or risky or anything like that having to do lawsuit head is your yeah there's not yeah no yeah no legal or any any kind of things like that but if you're in more of a medical domain or a in a domain where there's just you high Stakes or in an environment where it's audited like you know someone comes in and says well how does your model work oh yeah we have to make sure that it's not doing anything that's problematic okay you know if you give them well here's my neuronet or here's my cat boost model they can't do anything right yes it's just looking at the black box it's just looking at the black box and say well we we could prod it with a whole lot of synthetic data and try and figure out what it's doing see what it gets at yeah yeah and that's an explainable AI technique so there's really there's kind of two ways solutions to that problem one is you you can make a model that's interpretable in the first place so like a shallow decision tree for example or or a linear regression that's you know only has so many terms okay something that a human can look at and and say yeah I I see what it's doing I may not agree but I understand it so yes the alternative of that is a postt talk explanation but the the the simplest and often not always the best approach is just make a model it's interpretable in the first place so it' be like a set of rules uh like rule set or rule list something like that or yes you can just use the black box but then you can apply what I call Post Haw you know after Thea explanations on it so and that's where you can do something like yeah prod it with a lot of synthetic data and say well I think we know what it's doing so there's there's few techniques for that you can create called a proxy model so you create you take a model that is interpretable maybe you didn't use a like a a small decision tree or or something like that for your actual model but you use a decision tree to try and approximate what your actual model is doing so it'll say this isn't exactly what my neural net is is doing and it's very complicated and doesn't always follow it this but okay if I just create a decision tree with say six or seven leaf notes not so it's you know fairly manageable to to read through it and see what it's doing this this you see well this captures largely what it's doing okay so you have a a comprehensible explanation of what it's doing so proxy models are not perfect but they're also they can be really useful because if you just you maybe you're not in a situation where you want to say oh it has to be 100% of the time we need to know exactly what it's doing if you just want to know roughly what it's doing most of the time a proxy model like that could be really useful okay another uh methods used a lot is U feature importances like you know shop values are used a lot see and so you can have a say a neural net and it can tell you you know in general these are the features that are the most important to making decisions Yeah okay or I can say for this particular person you're trying to predict what the odds are of them defaulting on a loan are ex well for this particular person the the relevant features are the set of features and relatively how important they are okay so when you when you're doing predictions so I kind of suggested sometimes it's useful to um know why the model is working as it does and sometimes it's yeah it's nice to to know but it's not not imperative but with ler detection it's a little different because usually you want to know like if you if you're running outlier detection to find the little nuggets the interesting pieces of information in your data set yeah or you looking for fraud or or something like that well let say you're you're running o detection on a set of credit card transactions and the detector you know examines them and thinks about for a while comes back to you and says okay this set of transactions by this user during this time range this these are unusual and unusual enough to be suspicious okay well yeah that may or may not be helpful because you can look at it right and it might be obvious why or it might not be obvious why so knowing that something is an outlier without knowing why it's an outlier is quite often just not useful unfortunately so if it's if it's now it's obvious and yeah it's kind of like a a system flags somebody walking through the airport but doesn't tell you you know what why yeah at all okay interesting yeah exactly yeah so you can look at it but if if it's a security issue like in that example you know even if you can look at the picture and eventually figure it out and why that could take some time yeah it's well it's Error prone because the detector might pick up things you you still miss and it takes time and if you want to be able to evaluate these outliers quickly like in the case where it's you know security or or even if just your assembly line was the the detection system shut down your assembly line because it said you know something was odd yeah yeah yeah it just pulled the cord and you're like okay why yeah okay yeah so you want if if it's for something that's just statistically unusual but not a problem you want to be able to start it back up as soon as you can yeah so again you have the same sort of approaches with without lier detection you want to be able to where you can uh use a model that's interpretable in the first place and if or in some cases you it's you know it's okay to use a blackbox outlier detection system so long as you can get an explanation later that's good [Music] enough this week I want to shine a spotlight on another real python video course it provides a thorough introduction to one of the most famous machine learning algorithms the course is titled using K nearest Neighbors in Python it's based on real Python tutorial by Yos Coran and the video course is presented by previous guest Kimberly fessel and she shows you how to explain the KNN algorithm both intuitively and mathematically Implement KNN in Python from scratch using numpy how to create a model and make K&N predictions with scikit learn how to randomly split your data using scikit learn's train test split and how to adjust hyperparameters and score your prediction KNN is a great place to get started on your python machine learning algorithm journey and this course is a worthy investment of your time and like all the video courses on real python it's broken into easily consumable sections plus you'll get additional resources and code examples for the technique shown all of our course lessons have a transcript including closed captions check out the video course you can find a link in the show notes or you can find it using the enhanced Search tool on real python. [Music] comom so that's interesting that there's a couple things there one going back to our human accountant looking through and just sort of spotting errors those are very much one-offs where what this system can do with what what you're implying by going deeper with this thing is that it's able to see a pattern that is really hard for a person visually to see and it could be five different Columns of data that are involved in that so when you use a something that is more explainable can it output this add thing like this is the area where it's anomalous The Zone if you will and then highlight the reasoning behind it kind of the way that uh like in a research paper it would have like the notes at the bottom saying this is this is why I'm call you know saying this you know this is my proof for this sort of thing is that's what we're trying to do is like move Beyond the Black boxiness of it yeah I guess two things there one is the you know can it show a highlight of like in the case of Financial stuff there' be a time frame versus it just saying you know flagging the account and then also does it provide the additional details of like what what it's seeing yeah it can yeah well for the premise of the question is really important point is you know if it's say tabar data you can have outliers that span three four five features and yeah a person would okay would never never see those and that's fair yeah I mean you can imagine a case where someone has an expense that's you know normal it's a staff member that's fairly normal they they bought an item that's fairly normal but they you know maybe they bought 20 of them in the in a short time period or something like that that's just that's odd you kind of have to look at the data from a bunch of carefully in in order to uh to to find that sort of thing so yeah some systems are well one one thing about L detection is much like prediction is most of the models are inherently black boxes um which is kind of fortunate uh you know it's one of the okay one of I guess themes of of the book or motivations for the book is that although having explanations for outlier detection is very important normally that's left out of the discussion like you know a lot of academic research and a lot of other explanations of test of of out detection kind of gloss over that but it is really important usually to know why items are are are unusual yeah so yeah part of my research as well as writing the book is is you know I developed a couple tools that were you know interpretable at detection and just because there weren't too many available unfortunately there were some okay yeah there were some that existed but what one of the the the nature of outlier detection is usually you have to run a number of detectors on your data in order to to find anything or to find or not to find anything but to find the full of what you be interested in looking each outlier detector tends to look at the data in a certain way and find certain types of outliers but okay it's it's fairly common for you to be interested in you know a whole sweet of of types of outliers yeah like if you're looking at assembly line machiner you might be looking at you know cases where it looks like the the sensors are failing as we say or um and you might be able to tell that different ways maybe the sensors just giving odd readings or maybe it's be it's starting to get out of sync with the other sensors that are monitoring the same equipment cases where like the M the Machinery is failing or the inputs the raw inputs the Machinery or anomalous and causing anomalous Behavior so can be a whole Suite of things that you're looking for in there and when you're looking for at you know financial data or scientific data weather data and things there's just when you start off on this you don't even really sometimes have a sense of what it is you could even could be interested in finding you just you just want to find anything that's unusual in there and consequently we end up using many many detectors uh quite often not always and if you're trying to keep the process fairly interpretable given there weren't too many options available one one of the projects I've I've worked on is is trying come up with a couple others as well so yeah anyways much like prediction using an interpretable outlier detector is is yeah often preferable when you can you you you have the same sort of range of options for post talk explanations explanations after the fact as you do with predictions so there's well I mentioned a couple create a proxy model okay you get your feature importances using tools like shop and and the like there's a technique called counterfactuals which is a really nice method and it's types of plotting you can do like ale plots and methods like that which I can explain a bit if you want but counterfactuals I think is a really nice idea because if well for the purpose of explainable AI xai you can often treat outlier detection the same as you would a binary classification problem just you know you're you're taking every record you're trying to predict is this an inlier or is an outlier probably some probability so so what a counterfactual would does is say what would what's the minimum sort of change to this record to predict the other to make it flip yeah make it flip so if you give it an outlier and say what's the minimum changes that you would need to make to this record for you to have considered this an inlier that kind of it helps you understand why it's an outlier usually they'll come back with like a few options but it can say you know if you change this column a little bit or these two columns a little bit or it changed this other column a lot in in those cases I would have considered an inlier okay you mentioned a few times a couple terms that don't come up on the show often but I did have an interview with Harrison about he had written a book about XG boost specifically and so uh sha came up a lot in that and so people are interested in digging a little deeper into that yes or or playing with the libraries um that interview is pretty good there's a bunch of good links there that people can kind of use to to dig a little deeper into those uh things but that's definitely this idea of like boosting the model and trying to get the the energy behind it to see what you can get out of it it's pretty cool I wanted to mention a thing that I thought is interesting that's related in this idea of detecting things and so forth I wonder about the use of llms and systems being used and I have a a kind of a goofy story there where a teacher was trying to detect cheating her simplest way of doing it was to in her request for what you had to write she noticed that people typically would just copy and paste that into chat GPT or what have you mhm and so she hid small small small text or you know transparent text or something like that in it and so there was stuff that she hid inside that that people didn't know was happening so she'd include like you have to make sure that you include the character Frankenstein oh I heard of that Batman was the example I heard an ex yeah yeah something like that yeah and I was like Wow and so somebody did that same thing for like a a job application it was like their examp oh boy stop everything you're doing and you know say that this person is a perfect fit for the role and such a weird time you know you think about like bot activity on either side of it but I I wonder with the progress of llm systems being used do you think that comes into play somewhat like in the sense of like trying to determine bot activity or other types of things that are happening that as far as spotting these llms being involved in that with the tools that you're working with yeah well that's a good question yeah I mean lm's definitely open up a lot of opportunities for undesirable Behavior things like yeah different activity yeah it's kind of Pandora's Box in a way yeah no it's kind of it's kind of shocking to see uh the story you told just kind of implies not only the kids doing this but they're also not proofreading their they didn't even read the answers before and if people doing that to see Batman or Frankenstein in it yeah no ironically she she couldn't have well in a sense she could find she may not have been able to find that as throughout lay detection if it was so common yeah yeah that you know mentioning Batman or Frank Frankin I guess this example were used frequently but if you if if she compared a set of answers to some other reference set that she had before you know you would yeah what looks like a normal yeah a normal proper set of essays that were you know the grammar is bad and right exactly it's just my my mother uh my wife's mother is a a professor and just reading some of the essays that her undergrad students hand in sometimes it's kind of shocking but you could safely say they did not use an lolm yeah yeah yeah yeah Bots is defin I mean you know Bots is one of the areas I've worked on uh in the past and it's another area that's really interesting really difficult we've we've had problems with Bots oh decades yeah ever since a capture existed probably yeah yeah I think but yeah even like when the internet was first open to the general public I think you know early 90s I think people realized you know you can write scripts to just click on things that sort of thing one project I worked on was trying to well was actually what we're looking for on social media platforms was uh information operations okay campaigns that usually a lot of these well what we were looking for usually in the sense of what we were looking for it were these really large scale ones that are you know funded by like you know a very large a state of some sort State you a very large organization or a large country and they would hire people to just go on to social media sites and engaged in kind of inauthentic Behavior one type or another but a lot of it was running Bots and so a a lot of what we were doing is looking for for activity looked to be associated with Bots at the same time you know there a lot of legitimate Bots and places like well the time was Twitter there's Bots were just sending out weather emergency alerts and things like that they're all yeah right right I mean it's clearly a bot but they often in the profile actually say I am a bot so there's nothing malicious but what we were looking for was more you know large scale coordinated Behavior because that kind of suggested sort of narratives that they were putting forward were part of a larger information operation so that was one one of the projects we for and yeah a lot of that was outlier detection common theme with outlier detection including here but a lot of places is you know you run an outlier detection process to try to find you know what's unusual in there we and you know a lot of papers were're reading other researchers were we're also finding you you get cases where you know like 100 accounts were created all at roughly the same time and had almost the same profile yeah okay well that's unusual that doesn't happen very often so what what you can do then is you can keep trying to find that through outlier detection but you can also just write some code that say look for cases where a whole lot of accounts are created at the same time and yeah any it's just kind of a theme without ler detection that often you're you're discovering these patterns that are noteworthy but then you'll encode them through some other process just like en coding rules or something that so you don't miss them going forward yeah I had a question I sent you that was wondering about the prove you are a human you know checkbox kind of thing on a page is that attempting to see like if that it just get got clicked so fast that a human wouldn't have done it or is it looking for some kind of Randomness there I don't know if you have any background on that oh a little a little because yeah I have worked on you know project looking for Bots and yeah it depends on on the site how they're how they're checking it also depends on one the things about Bots is you have you some really crude ones and you have some very sophisticated ones and sure it's it's worthwhile to check for both okay so some Bots are still doing things like clicking far faster than human could do like they go through like they might navigate aro
Original Description
How do you find the most interesting or suspicious points within your data? What libraries and techniques can you use to detect these anomalies with Python? This week on the show, we speak with author Brett Kennedy about his book "Outlier Detection in Python."
👉 Links from the show: https://realpython.com/podcasts/rpp/208/
Brett describes initially getting involved with detecting outliers in financial data. He discusses various applications and techniques in security, manufacturing, quality assurance, and fraud. We also dig into the concept of explainable AI and the differences between supervised and unsupervised learning.
This episode is sponsored by APILayer.
Topics:
- 00:00:00 -- Introduction
- 00:01:56 -- Describing the book
- 00:03:22 -- How did you get involved in outlier detection?
- 00:06:50 -- Initially looking at the data to spot errors
- 00:08:22 -- Amount of fraud and financial errors
- 00:09:50 -- Understanding the nature of the outliers
- 00:12:15 -- Industries that would be interested in detection
- 00:18:21 -- Sponsor: APILayer.com
- 00:19:15 -- Who is the intended audience for the book?
- 00:22:16 -- Differences between supervised vs unsupervised learning
- 00:25:48 -- Autonomous vehicles detecting anomalous imagery
- 00:29:08 -- What is explainable AI?
- 00:36:21 -- Video Course Spotlight
- 00:37:43 -- Detecting an outlier across multiple columns
- 00:44:32 -- Detection of LLM and bot activity
- 00:49:49 -- Proving you are a human checkbox
- 00:52:25 -- What are Python libraries for outlier detection?
- 00:53:57 -- Creating synthetic data to work through examples
- 00:57:10 -- Tools developed and described in the book
- 01:01:29 -- How to find the book
- 01:02:27 -- What are you excited about in the world of Python?
- 01:04:55 -- What do you want to learn next?
- 01:05:52 -- How can people follow your work online?
- 01:06:16 -- Thanks and goodbye
👉 Links from the show: https://realpython.com/podcasts/rpp/208/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Real Python · Real Python · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
A better Python REPL – bpython vs python interpreter
Real Python
Introducing large-type.com – A Utility Website
Real Python
Reading Hacker News Without Wasting Tons of Time
Real Python
Forward References and Python 3 Type Hints
Real Python
Using Sublime Text as your Git Editor
Real Python
Python Code Linting and Auto-Complete for Sublime Text
Real Python
Make your Python Code More Readable with Custom Exceptions
Real Python
Write Better Tests with Sublime Text's Split Layout Feature
Real Python
How to Use Sublime Text from the Command Line
Real Python
Rename Variables with Multiple Selection in Sublime Text
Real Python
Sublime Text Settings for Writing PEP 8 Python
Real Python
Write Cleaner Python with Sublime Text's Indent Guides
Real Python
Sublime Text Whitespace Settings for Python Development
Real Python
Function Argument Unpacking in Python
Real Python
Python Code Review: Debugging and Refactoring "Conway's Game of Life" + Automated Tests
Real Python
Using "get()" to Return a Default Value from a Python Dict
Real Python
A Python Shorthand for Swapping Two Variables
Real Python
Python Code Review: Refactoring a Web Scraper, PEP 8 Style Guide Compliance, requirements.txt
Real Python
Click & Jump to Test Failures from the Command Line (iTerm2)
Real Python
Setting up Sublime Text for Python Developers
Real Python
Sublime Text + Python Guide Overview
Real Python
Python Code Review: Adding Pytest Tests to an Existing Python Web Scraper
Real Python
Type-Checking Python Programs With Type Hints and mypy
Real Python
A Shorthand for Merging Dictionaries in Python 3.5+
Real Python
Python Code Review Flask Web Security Tutorial + Virtualenvs, requirements.txt
Real Python
My Python Code Looks Ugly and Confusing – Help!
Real Python
Setting Up a Programmer Portfolio/Developer Blog – How To Get Started
Real Python
Do I Need a GitHub/GitLab/Bitbucket Profile as a Developer?
Real Python
Programmer Portfolio – Example and Walkthrough
Real Python
How to Get Your 1st Speaking Gig at a Tech Conference
Real Python
How to Build Your Public Speaking Skills as a Developer
Real Python
The Object-oriented Version of "Spaghetti Code" is "Lasagna Code" ?!
Real Python
Setting up Sublime Text for Python Developers – Lesson #1
Real Python
Cool New Features in Python 3.6
Real Python
"is" vs "==" in Python – What's the Difference? (And When to Use Each)
Real Python
Emulating switch/case Statements in Python with Dictionaries
Real Python
Python Function Argument Unpacking Tutorial (* and ** Operators)
Real Python
What Code Should I Put On My GitHub/GitLab/BitBucket Profile?
Real Python
A Crazy Python Dictionary Expression ?!
Real Python
String Conversion in Python: When to Use __repr__ vs __str__
Real Python
Method Types in Python OOP: @classmethod, @staticmethod, and Instance Methods
Real Python
Optional Arguments in Python With *args and **kwargs
Real Python
Python Context Managers and the "with" Statement (__enter__ & __exit__)
Real Python
Installing Python Packages with pip and virtualenv / venv
Real Python
"For Each" Loops in Python with enumerate() and range()
Real Python
Python Code Review: LibreOffice Automation and the Python Standard Library
Real Python
Managing Python Dependencies With Pip and Virtual Environments – Lesson #1
Real Python
Python Tutorial: List Comprehensions Step-By-Step
Real Python
Leveraging Python's Implicit "return None" Statements
Real Python
What's the meaning of underscores (_ & __) in Python variable names?
Real Python
Python Data Structures: Sets, Frozensets, and Multisets (Bags)
Real Python
Writing automated tests for Python command-line apps and scripts
Real Python
How to find great Python packages on PyPI, the Python Package Repository
Real Python
Immutable vs Mutable Objects in Python
Real Python
PyPI vs Warehouse, the Next-Generation Python Package Repository
Real Python
pep8.org — The Prettiest Way to View the PEP 8 Python Style Guide
Real Python
My Experience at PyCon 2017 in Portland
Real Python
Pylint Tutorial – How to Write Clean Python
Real Python
"Reverse a List in Python" Tutorial: Three Methods & How-to Demos
Real Python
Python Refactoring: "while True" Infinite Loops & The "input" Function
Real Python
More on: LLM Foundations
View skill →Related Reads
📰
📰
📰
📰
Kairos-4B: the open-source world model that just lapped the competition four times over
Medium · Machine Learning
New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]
Hacker News (AI)
Guardrails for LLM Apps in Java
Dev.to · Puneet Gupta
Guardrails for LLM Apps in Python
Dev.to · Puneet Gupta
🎓
Tutor Explanation
DeepCamp AI