[MINI] Calculating Feature Importance

Data Skeptic · Beginner ·📐 ML Fundamentals ·9y ago

Skills: Supervised Learning70%ML Maths Basics60%

Key Takeaways

The video discusses calculating feature importance for machine learning models created with the random forest algorithm, using techniques such as removing a feature and measuring the decrease in accuracy or Gini values.

Full Transcript

[Music] data skeptic mini episodes provide highlevel descriptions of key concepts related to data science and skepticism our topic for today is calculating feature [Music] importance so I want to do a quick correction before we jump into this episode Shan law on Twitter reached out and said that apparently in the last mini episode when we were talking about random Forest instead of saying that bagging meant bootstrap aggregation which is the correct answer I guess I said boosting aggregation and boosting is something similar to bagging but very different and bagging absolutely relies on the bootstrap so thank you for that correction I think we'll talk about the bootstrap in a future episode specifically a mini so we'll get into it then and what the difference is but for this week I wanted to follow up a little bit on our random Forest discussion from last time Linda Oh yes yes we were talking about retail yes and we were do you recall that a random forest model is made up of lots and lots of smaller decision tree models individual models yep yep trees make a forest and you lump them into different forests and then they vote that's right but now what if you wanted to figure out which feature was the most important well I don't know you're going to tell me right that's the whole point of this episode so I thought maybe for this week we could discuss a different data set set one that's often used with random forest and that's the Titanic data set I don't think we've ever mentioned that before actually do you know what that is I'm assuming you mean the disaster sh disaster that hit the iceberg and tons of people died that's right and some people survived so there's some researchers that put together a data set I think it's based on historical data I guess it could be fake but I'm pretty sure it's real and it has different information about all the passengers let me tell you what it's got can I just interject and say the Titanic movie clearly showed that people survived I'm glad we have James Cameron helping to educate yes so we know for each passenger whether or not they survived we know their class cuz I guess there was first second and third class presumably first class is the best we know their name which really probably shouldn't be useful at all we're more likely to overfit the data if we use the name but who knows maybe uh I don't know they discriminated against certain people and we can figure out from their last name their nationality or something crazy like that but most likely name is not useful we know the sex of the person we know their age and for like each family we know the number of parents and children on board if you had to hypothesize a couple of different ways those variables might get used how do you think for example the number of siblings might get used I figure if you're a family with a lot of kids they might have let you on boats earlier I think if you're a parent and the ship's going down you're going to make that I mean I don't know about you but I feel like I would knock my kid out if they were not willing to come oh so that's interesting yeah like one of the things you might look at is whether or not having many children affected an adults likelihood of survival yeah that's true machine learning algorithms would do that sort of analysis for you it would if that was in the data it would naturally emerge but it's still interesting to ponder what some of the models might pick up on we also know the ticket number of the person which you'd think would be useless but maybe like people who bought their tickets early have an advantage or something like that so that could be interesting and then we know the fair price how much they paid for it as well as the type of cabin they have and where they got on which Port they boarded the ship at thinking about all those features if you know those about every passenger which ones do you think are the most informative if you had to just hypothesize I guess age why is that well cuz I think if you're elderly you probably wouldn't have made it out in time and then when you hit the cold water unclear and then I mean there are stories per the Titanic movie that suggest that old people gave up their lives so that younger people could live actually that's a really interesting case so you kind of brought up two reasons you said maybe an elderly person gave up their spot and maybe a different elderly person was not able to get on the boats or survive being in the water so those are two different reasons why that person might not have survived they reach the same conclusion but for different reasons and that's the type of thing random forest and other ensembling type models are good for one or maybe a set of trees might capture this altruistic Behavior somehow and then a different set of trees might capture the you know sort of INF feeble kind of thing you're talking about as well what about gender do you think that is a strongly predictive feature well I think women and men are are both strong and have lots of endurance so no well so this is not a commentary on the capabilities of anyone it's just it's a a model that tries to describe the reality of what actually happened at the thing so I I actually believe gender would have a big role in this because I've heard this expression that they used they did women and children first under the boats MH I don't know if that's true but if it is then you'd expect women have a higher chance of survival right right Y and children too so then age would would correlate there all the trees would look over all these variables and come up with an ensemble model while we're talking about it what about the passenger Fair do you think the price someone paid for their tickets going to have an effect so I mean that really depends on the map of the Titanic and where the high class fairs are and the cheap ones and then also the impact of the ship like when it was sinking did it sink from the like the back from the bottom from the middle and which cabins were impacted cuz I think it was the middle of the night so people are most likely sleeping so the people who got hit first probably didn't survive and then the ones who were on the second tier third tier whatever the order of which was sinking may have survived more yeah totally now do you think that's perfectly predictive did everyone on the lower deck certainly perish I mean I don't think so I think the iceberg was like a slow thing like they had like 4 hours I I mean I don't know the exact time sure but it was a big ship so it didn't go down immediately well even if it did there's no guarantee that just because you were on the lower deck you were in your room at the time right maybe someone was up late having a at a party or something and that might increase their chances of survival although being up late at a party is not one of the features we have available so our model won't be able to capture that but once you've trained your model and you've looked at its Diagnostics its accuracy and all that sort of stuff it might be good to try and inspect that model a little bit say how is this working which features are important you want to have some interpretability of your model when you mean interpretability what does that mean let's say you spend a good deal of time building a nice model and then you go to some conference or something and say hey I I'm good at predicting or I have a model that explains why people did or did not survive the Titanic disaster and someone will come up to you and say well how does it work and if you say well I used random Forest they might say I don't know what that is or they might say well that's nice but how does your model actually work not what algorithm did you train it with but what's it do what is it doing when you feed it new inputs how does it convolute those together into an output but now we have a lot of trees right a forest of them do you think they all considered gender or maybe were some making good predictions without considering the gender I think you can make good predictions without the gender I.E the sex yeah yeah exactly I would say that we probably have a ton of models all taking a different approach maybe some focus on the fair some focus on the cabin some focus on the combination of the sex and the age we'll probably come up with a forest of very diverse models that all look at these features in different ways for example like maybe one model determines that if there's two parents on board and a young child that it survives better than if there was only one parent on board with a small child that could be like one corner C that a few of the models are able to process really well so then if you said well how important is it to know the age if every model uses age differently or not at all how do you decide if it's important or not would you say which one was most statistically significant it would be great to have a notion of statistical significance here but there's no formal statistic involved we kind of have to come up with a method oh actually here's a good Cory I don't know what's a political topic going on right now we're going to be voting on the president soon right right so what are some of the issues in the election Republican or Democrat yes that's good can we drill down what about environment or something taxes it's always a big deal okay what about net neutrality do you think that's important in the election no not at all that's an important issue to me not important well I'll agree it's less important than taxes but how do we know that first of all there's the media is the media talking about it but maybe if the media talks about it they make it an issue so it's like true unclear there if it's important just because the media is talking about it I mean obviously you can make up a scenario where you go oh if Hillary believed XYZ would you vote for her Ah that's really interesting right there yeah make up these scenarios say what if the candidate had a different position what if Hillary flip-flopped her position on net neutrality how many votes would that change would it sway the election net neutrality probably has some taxes probably has a lot for exactly your reason if one or both of the candidates switch their position it would change a lot of people's votes and therefore affect the out so maybe we can take that same idea into machine learning we built a nice model predicting the Titanic that uses age what if we deleted age what if we didn't have it how would the model change could we still build as good of a model what do you think no how much worse do you think it would be not considering the age 50% worse 50% worse okay good uh even though that's a very arbitrary number that's a a number of large magnitude so good instincts What If instead of withholding the age I withheld their ticket number how much do you think that would affect the model you build here's the thing I don't know if the ticket number correlates with their class oh interesting if it correlates with with their class then it does if it doesn't they just gave out tickets random the numbers are random in the order of which people bought them then maybe it doesn't matter yeah my instinct would be to say it doesn't matter or if it matters it probably matters very little cuz yes that order might encode some other information it might you know correlate with another value you could measure but yeah in general the ticket number should be less important or that's what we expect so when we build these models a lot of times we like to have a feature importance for a couple of reasons one is to kind of validate your model if you went and built a model to study the Titanic and you came back to me and said Kyle my most important feature is age I'd be like okay yeah that's plausible if you came back and said the most important feature is the name I'd be like well that you've either overfit or you've done something wrong cuz name can't be the most important it doesn't make sense but even besides that if you built a model that didn't have those sorts of intuitions it's still important to know what are the features that the model is finding most useful cuz perhaps you want to focus on ways to better measure that feature in the future gender you can't measure anymore perfectly unless there was some human transcription error but if you were measuring like I don't know how healthy the person is and you had just like healthy unhealthy and that was really useful maybe you want to break that down into more categories or something like that so how do you do this feature importance well you do it analogously to the way you kind of invented about changing Hillary's Vote or changing Trump's Vote or something like that take a feature out and does it hurt the accuracy or you can look at take a feature out and see how it affects the genie coefficients of the trees so tell me is the genie coefficient related to a Genie in a Bottle well a good question seeing as how I haven't introduced the genie coefficient but that we will leave for a topic for future episodes because it's a good topic in and of itself that's what Kyle says when he doesn't know what it means I know what it means a he doesn't he's going to wait for someone to tweet him what it is it's a it's a bit like entropy which we'll also talk about so uh we came up with a couple future topics Genie coefficients boosting and uh entropy which might be the same as Genie so stay tuned for those in the future and thanks as always for joining me Linda thank you so if anyone listening is not on our data skeptic slack Channel and you'd like to join I'd encourage it shoot me an email to Kyle datas skeptic.com and I'll send you over an invite we're not just using it for the openhouse project anymore of course that's still there and we're glad to have anyone who wants to volunteer as that expands but there's going to be some new announcements in the near future about other stuff that's going to be going on in our slack channel so email me to get signed up and until next time I just want to remind everyone to keep thinking skeptically of and with data and more on this episode visit datas skeptic.com if you enjoyed the show please give us a review on iTunes or Stitcher

Original Description

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 38 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

This video teaches how to calculate feature importance for random forest models by removing features and measuring accuracy or Gini value changes, which is crucial for understanding model behavior.

Key Takeaways

Create a random forest model
Remove a feature from the model
Measure the decrease in accuracy or Gini values
Repeat the process for all features
Compare the results to determine feature importance

💡 Removing a feature and measuring the decrease in accuracy or Gini values can provide valuable insights into feature importance for random forest models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Python

The Python Dictionary Trick That Makes Interviewers Smile

Learn the Python dictionary trick that impresses interviewers and improves your coding skills

Dev.to · Ameer Abdullah

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB