[MINI] Calculating Feature Importance
Key Takeaways
The video discusses calculating feature importance for machine learning models created with the random forest algorithm, using techniques such as removing a feature and measuring the decrease in accuracy or Gini values.
Full Transcript
[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism our topic for today is calculating feature [Music] importance so I want to do a quick correction before we jump into this episode Shan law on Twitter reached out and said that apparently in the last mini episode when we were talking about random Forest instead of saying that bagging meant bootstrap aggregation which is the correct answer I guess I said boosting aggregation and boosting is something similar to bagging but very different and bagging absolutely relies on the bootstrap so thank you for that correction I think we'll talk about the bootstrap in a future episode specifically a mini so we'll get into it then and what the difference is but for this week I wanted to follow up a little bit on our random Forest discussion from last time Linda Oh yes yes we were talking about retail yes and we were do you recall that a random forest model is made up of lots and lots of smaller decision tree models individual models yep yep trees make a forest and you lump them into different forests and then they vote that's right but now what if you wanted to figure out which feature was the most important well I don't know you're going to tell me right that's the whole point of this episode so I thought maybe for this week we could discuss a different data set set one that's often used with random forest and that's the Titanic data set I don't think we've ever mentioned that before actually do you know what that is I'm assuming you mean the disaster sh disaster that hit the iceberg and tons of people died that's right and some people survived so there's some researchers that put together a data set I think it's based on historical data I guess it could be fake but I'm pretty sure it's real and it has different information about all the passengers let me tell you what it's got can I just interject and say the Titanic movie clearly showed that people survived I'm glad we have James Cameron helping to educate yes so we know for each passenger whether or not they survived we know their class cuz I guess there was first second and third class presumably first class is the best we know their name which really probably shouldn't be useful at all we're more likely to overfit the data if we use the name but who knows maybe uh I don't know they discriminated against certain people and we can figure out from their last name their nationality or something crazy like that but most likely name is not useful we know the sex of the person we know their age and for like each family we know the number of parents and children on board if you had to hypothesize a couple of different ways those variables might get used how do you think for example the number of siblings might get used I figure if you're a family with a lot of kids they might have let you on boats earlier I think if you're a parent and the ship's going down you're going to make that I mean I don't know about you but I feel like I would knock my kid out if they were not willing to come oh so that's interesting yeah like one of the things you might look at is whether or not having many children affected an adults likelihood of survival yeah that's true machine learning algorithms would do that sort of analysis for you it would if that was in the data it would naturally emerge but it's still interesting to ponder what some of the models might pick up on we also know the ticket number of the person which you'd think would be useless but maybe like people who bought their tickets early have an advantage or something like that so that could be interesting and then we know the fair price how much they paid for it as well as the type of cabin they have and where they got on which Port they boarded the ship at thinking about all those features if you know those about every passenger which ones do you think are the most informative if you had to just hypothesize I guess age why is that well cuz I think if you're elderly you probably wouldn't have made it out in time and then when you hit the cold water unclear and then I mean there are stories per the Titanic movie that suggest that old people gave up their lives so that younger people could live actually that's a really interesting case so you kind of brought up two reasons you said maybe an elderly person gave up their spot and maybe a different elderly person was not able to get on the boats or survive being in the water so those are two different reasons why that person might not have survived they reach the same conclusion but for different reasons and that's the type of thing random forest and other ensembling type models are good for one or maybe a set of trees might capture this altruistic Behavior somehow and then a different set of trees might capture the you know sort of INF feeble kind of thing you're talking about as well what about gender do you think that is a strongly predictive feature well I think women and men are are both strong and have lots of endurance so no well so this is not a commentary on the capabilities of anyone it's just it's a a model that tries to describe the reality of what actually happened at the thing so I I actually believe gender would have a big role in this because I've heard this expression that they used they did women and children first under the boats MH I don't know if that's true but if it is then you'd expect women have a higher chance of survival right right Y and children too so then age would would correlate there all the trees would look over all these variables and come up with an ensemble model while we're talking about it what about the passenger Fair do you think the price someone paid for their tickets going to have an effect so I mean that really depends on the map of the Titanic and where the high class fairs are and the cheap ones and then also the impact of the ship like when it was sinking did it sink from the like the back from the bottom from the middle and which cabins were impacted cuz I think it was the middle of the night so people are most likely sleeping so the people who got hit first probably didn't survive and then the ones who were on the second tier third tier whatever the order of which was sinking may have survived more yeah totally now do you think that's perfectly predictive did everyone on the lower deck certainly perish I mean I don't think so I think the iceberg was like a slow thing like they had like 4 hours I I mean I don't know the exact time sure but it was a big ship so it didn't go down immediately well even if it did there's no guarantee that just because you were on the lower deck you were in your room at the time right maybe someone was up late having a at a party or something and that might increase their chances of survival although being up late at a party is not one of the features we have available so our model won't be able to capture that but once you've trained your model and you've looked at its Diagnostics its accuracy and all that sort of stuff it might be good to try and inspect that model a little bit say how is this working which features are important you want to have some interpretability of your model when you mean interpretability what does that mean let's say you spend a good deal of time building a nice model and then you go to some conference or something and say hey I I'm good at predicting or I have a model that explains why people did or did not survive the Titanic disaster and someone will come up to you and say well how does it work and if you say well I used random Forest they might say I don't know what that is or they might say well that's nice but how does your model actually work not what algorithm did you train it with but what's it do what is it doing when you feed it new inputs how does it convolute those together into an output but now we have a lot of trees right a forest of them do you think they all considered gender or maybe were some making good predictions without considering the gender I think you can make good predictions without the gender I.E the sex yeah yeah exactly I would say that we probably have a ton of models all taking a different approach maybe some focus on the fair some focus on the cabin some focus on the combination of the sex and the age we'll probably come up with a forest of very diverse models that all look at these features in different ways for example like maybe one model determines that if there's two parents on board and a young child that it survives better than if there was only one parent on board with a small child that could be like one corner C that a few of the models are able to process really well so then if you said well how important is it to know the age if every model uses age differently or not at all how do you decide if it's important or not would you say which one was most statistically significant it would be great to have a notion of statistical significance here but there's no formal statistic involved we kind of have to come up with a method oh actually here's a good Cory I don't know what's a political topic going on right now we're going to be voting on the president soon right right so what are some of the issues in the election Republican or Democrat yes that's good can we drill down what about environment or something taxes it's always a big deal okay what about net neutrality do you think that's important in the election no not at all that's an important issue to me not important well I'll agree it's less important than taxes but how do we know that first of all there's the media is the media talking about it but maybe if the media talks about it they make it an issue so it's like true unclear there if it's important just because the media is talking about it I mean obviously you can make up a scenario where you go oh if Hillary believed XYZ would you vote for her Ah that's really interesting right there yeah make up these scenarios say what if the candidate had a different position what if Hillary flip-flopped her position on net neutrality how many votes would that change would it sway the election net neutrality probably has some taxes probably has a lot for exactly your reason if one or both of the candidates switch their position it would change a lot of people's votes and therefore affect the out so maybe we can take that same idea into machine learning we built a nice model predicting the Titanic that uses age what if we deleted age what if we didn't have it how would the model change could we still build as good of a model what do you think no how much worse do you think it would be not considering the age 50% worse 50% worse okay good uh even though that's a very arbitrary number that's a a number of large magnitude so good instincts What If instead of withholding the age I withheld their ticket number how much do you think that would affect the model you build here's the thing I don't know if the ticket number correlates with their class oh interesting if it correlates with with their class then it does if it doesn't they just gave out tickets random the numbers are random in the order of which people bought them then maybe it doesn't matter yeah my instinct would be to say it doesn't matter or if it matters it probably matters very little cuz yes that order might encode some other information it might you know correlate with another value you could measure but yeah in general the ticket number should be less important or that's what we expect so when we build these models a lot of times we like to have a feature importance for a couple of reasons one is to kind of validate your model if you went and built a model to study the Titanic and you came back to me and said Kyle my most important feature is age I'd be like okay yeah that's plausible if you came back and said the most important feature is the name I'd be like well that you've either overfit or you've done something wrong cuz name can't be the most important it doesn't make sense but even besides that if you built a model that didn't have those sorts of intuitions it's still important to know what are the features that the model is finding most useful cuz perhaps you want to focus on ways to better measure that feature in the future gender you can't measure anymore perfectly unless there was some human transcription error but if you were measuring like I don't know how healthy the person is and you had just like healthy unhealthy and that was really useful maybe you want to break that down into more categories or something like that so how do you do this feature importance well you do it analogously to the way you kind of invented about changing Hillary's Vote or changing Trump's Vote or something like that take a feature out and does it hurt the accuracy or you can look at take a feature out and see how it affects the genie coefficients of the trees so tell me is the genie coefficient related to a Genie in a Bottle well a good question seeing as how I haven't introduced the genie coefficient but that we will leave for a topic for future episodes because it's a good topic in and of itself that's what Kyle says when he doesn't know what it means I know what it means a he doesn't he's going to wait for someone to tweet him what it is it's a it's a bit like entropy which we'll also talk about so uh we came up with a couple future topics Genie coefficients boosting and uh entropy which might be the same as Genie so stay tuned for those in the future and thanks as always for joining me Linda thank you so if anyone listening is not on our data skeptic slack Channel and you'd like to join I'd encourage it shoot me an email to Kyle datas skeptic.com and I'll send you over an invite we're not just using it for the openhouse project anymore of course that's still there and we're glad to have anyone who wants to volunteer as that expands but there's going to be some new announcements in the near future about other stuff that's going to be going on in our slack channel so email me to get signed up and until next time I just want to remind everyone to keep thinking skeptically of and with data and more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher
Original Description
For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Skeptic · Data Skeptic · 38 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
▶
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Data Skeptic book giveaway contest winner selection
Data Skeptic
OpenHouse - Front end and API overview
Data Skeptic
OpenHouse Crawling with AWS Lambda
Data Skeptic
[MINI] Logistic Regression on Audio Data
Data Skeptic
Data Provenance and Reproducibility with Pachyderm
Data Skeptic
[MINI] Primer on Deep Learning
Data Skeptic
Big Data Tools and Trends
Data Skeptic
[MINI] Automated Feature Engineering
Data Skeptic
The Data Refuge Project
Data Skeptic
[MINI] The Perceptron
Data Skeptic
[MINI] Feed Forward Neural Networks
Data Skeptic
Data Science at Patreon
Data Skeptic
[MINI] Backpropagation
Data Skeptic
[MINI] GPU CPU
Data Skeptic
OpenHouse
Data Skeptic
[MINI] Generative Adversarial Networks
Data Skeptic
[MINI] AdaBoost
Data Skeptic
[MINI] The Bootstrap
Data Skeptic
[MINI] Dropout
Data Skeptic
[MINI] Gini Coefficients
Data Skeptic
[MINI] Random Forest
Data Skeptic
[MINI] Heteroskedasticity
Data Skeptic
[MINI] ANOVA
Data Skeptic
Urban Congestion
Data Skeptic
[MINI] The CAP Theorem
Data Skeptic
Unstructured Data for Finance
Data Skeptic
Detecting Terrorists with Facial Recognition?
Data Skeptic
Predictive Models on Random Data
Data Skeptic
[MINI] Entropy
Data Skeptic
[MINI] F1 Score
Data Skeptic
Causal Impact
Data Skeptic
Machine Learning on Images with Noisy Human-centric Labels
Data Skeptic
The Library Problem
Data Skeptic
Stealing Models from the Cloud
Data Skeptic
Data Science at eHarmony
Data Skeptic
Multiple Comparisons and Conversion Optimization
Data Skeptic
Election Predictions
Data Skeptic
[MINI] Calculating Feature Importance
Data Skeptic
MS Connect Conference
Data Skeptic
Music21
Data Skeptic
The Police Data and the Data Driven Justice Initiatives
Data Skeptic
Studying Competition and Gender Through Chess
Data Skeptic
[MINI] Goodhart's Law
Data Skeptic
Trusting Machine Learning Models with LIME
Data Skeptic
[MINI] Leakage
Data Skeptic
Predictive Policing
Data Skeptic
Mutli-Agent Diverse Generative Adversarial Networks
Data Skeptic
[MINI] Convolutional Neural Networks
Data Skeptic
Unsupervised Depth Perception
Data Skeptic
[MINI] Max-pooling
Data Skeptic
MS Build 2017
Data Skeptic
Activation Functions
Data Skeptic
Doctor AI
Data Skeptic
[MINI] The Vanishing Gradient
Data Skeptic
CosmosDB
Data Skeptic
Estimating Sheep Pain with Facial Recognition
Data Skeptic
[MINI] Conditional Independence
Data Skeptic
MINI: Bayesian Belief Networks
Data Skeptic
Project Common Voice
Data Skeptic
[MINI] Recurrent Neural Networks
Data Skeptic
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Data Science
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Python
The Python Dictionary Trick That Makes Interviewers Smile
Dev.to · Ameer Abdullah
🎓
Tutor Explanation
DeepCamp AI