Income Prediction Machine Learning Project in Python

NeuralNine · Intermediate ·📐 ML Fundamentals ·2y ago

Skills: Supervised Learning90%ML Pipelines80%ML Maths Basics70%

Key Takeaways

This video demonstrates a machine learning project in Python for income prediction using the Adult Income dataset from Kaggle, covering data preprocessing, feature engineering, and model training with Random Forest classifier and hyperparameter tuning using Grid Search CV.

Full Transcript

what is going on guys welcome back in this video today we're going to go through the full data science machine learning process to build a machine learning model that predicts people's income so this is going to be a perfect exercise for machine learning for beginners and for people that are somewhat intermediate we're going to take a data set we're going to explore the data set we're going to pre-process the data we're going to train a model on the data we're also going to do some hyper parameter tuning and in the end we're going to evaluate and interpret the results so for people new to machine learning kind of new to machine learning somewhat intermediate this is going to be a very good exercise and it will require you to have some basic uh skills already so you should be comfortable working with mat.lib and pandas you should understand basic ideas of machine learning like classification trained test splits evaluating models and stuff like that but this is going to be a great exercise so let us get right into it emergency [Music] all right so the first thing we're going to do in this video today is we're going to download a data set that we can use to train and evaluate our machine learning model and in this case we're going to go with the adult income data set you will find a link in the description down below to this kaggle page and you can see it says here that this is a good starter example for data pre-processing and machine learning practices which is exactly what we're looking for and when we scroll down here to the data set you can see why this is the case we can display all the columns here and you will see that we do have some numerical features like the H for example that we can use right away or for example here hours per week but most of the features here as you can see are categorical features that require some pre-processing so marital status occupation relationship race gender and stuff like that even our Target attribute the income is categorical because it's split up into less or equal to 50K and above 50k so this is actually a binary classification task that we're dealing with here and to download this data set now you just have to click on this button and this will download a zip file in archive.zip file and inside of that you will find a data set as a CSV file I'm going to extract it here in the working directory and I'm going to rename it to income.csv so that we know what this is actually about and then I'm going to start a new IPython notebook a new jupyter notebook you can of course also work in an ordinary python script but as always when we do some data science some machine learning work it makes sense to work with notebooks because you can run individual cells you don't have to train the model every time you run the script every time you change some uh slide thing after the model training you don't have to retrain them all every time so I would recommend working with a jupyter notebook if you have trouble setting up jupyter notebook or Jupiter lab which I'm using here you can check out the tutorials on my channel but it is not too difficult so I'm going to start here a new Jupiter notebook I'm going to rename it to main.ipy and B and for this video we're going to need some packages some external python packages probably well known to those of you who have been doing some data science or machine learning already ready so we're going to install them using pip we're going to need pandas we're going to need matte plot lip we're going to need Seaborn which is also for visualization we're going to need scikit Dash learn I'm not sure if we're going to need it but we're going to also install numpy and let me just briefly look through my prepared code if that's everything I think it is so those are very basic data science machine learning packages in Python we're going to install them in my case all of them are installed already and then we're going to import here for the beginning pandas as PD and maybe I should zoom in a little bit here so import pandas SPD and then we're going to load the CSV file into a data frame so we're going to say DF equals PD read CSV and we're going to load the income dot CSV file into a data frame and this is what it looks like now we have age work class I don't know what that is exactly hopefully it is described here uh no I don't think so but we're not going to use that feature I don't think that this is too relevant we're going to use the education now the educational number is not a numerical feature really it's a categorical feature encoded as a number so this is not uh the same as an H or a hours per week feature this is just a numerical label for these classes here so for these categories um then we also have marital status occupation relationship race gender capital gain Capital loss native country and the target attribute the income which is also categorical as we already discussed so our goal here will be to encode all these features in a way that is useful for our models so instead of having education and then these individual text classes and also instead of having them here represented by numbers we can encode this using one hot encoding now for this feature I'm not sure if that's the most reasonable way because you do have the education on some sort of scale it goes from low education to higher education so it might make sense to represent it here using one feature and the larger the number the higher the education but for the other features for example marital status or occupation it doesn't really make sense to represent them on a numerical scale it it makes sense to one hot and coat them to say that every single possible value here is its own feature and we can set it to zero or one so we're going to look at this here in a second but we can explore some of the features here this is always the first part of the data science process is the exploration part so exploring the data set and we can do that first of all as we did here by printing the data but also by looking at individual features so for example we can go here to education and we can print the value underscore counts and then you can see what kind of education we have here we have Bachelors Masters we have preschool and um yeah you can see here they're sorted by frequency not by not by importance or Not by height or what do you say so not by the power of the degree but by how many times it occurs I think this is high school graduation and you can do that for all these features so you can say DF work class for example value counts and the main reason you want to do that is because sometimes you might have a class that occurs exactly one time then you might want to drop that row because it's not that important it's also important to see for example I think it is in work class and occupation uh that we have there you go we have the same feature name so for example question mark here and question mark here is a different feature when we use one hot encoding now one hot encoding as I said means taking this occupation feature for example and turning it into these separate features so uh professional specialty craft repair exec managerial we're going to turn all of those into separate columns which are either zero or one so into binary features the problem with that is that we don't do this only for occupation we do the we do this for example also for work class and then you have this feature and this feature which have the same name but are actually different features and this is of course a problem because uh then you don't know which one you're targeting when you're using the question mark and for that we would use a so-called prefix when doing the one hot encoding so you can look at these things here for relationship for race gender and stuff like that uh important because some of the features might have only two values for example gender has male and female it doesn't make sense necessarily to one hot encode them because if you look at DF gender and then value counts you can see that we only have male and female here and because of that it doesn't make sense to make a feature male which is binary in a feature female which is binary it makes sense to turn the gender feature itself into a binary feature same is true for income it doesn't make sense to have less uh or equal to 50K as a column and then larger than 50k is a column it makes sense to just turn the income feature itself into a binary feature 0 or 1. so you could say wealthy or not wealthy for example um but yeah how do we actually want hot encode something it works quite easily you can do that with scikit-learn but we're going to do it even simpler we're going to do it with pandas itself because what you can do is you can say PD which is pandas get underscore dummies and then you can provide a column so for example DF occupation and then you can see that it automatically turns this into binary features so you have the individual column values that we had before so those here turned into binary features and you can see every time one of these columns is going to be set to one all the other columns are going to be set to zero because of course they're mutually exclusive because we only had one value before so this would now be one hot encoded and to add this to add this new column here to our data frame we would have to concatenate it but first of all we talked about the prefix so we have a question mark here to add a prefix to this what we do is we say get dummies and then add underscore prefix prefix like this and we can add for example in this case occupation underscore and then you would see that all these column names are now as before but within with a pre prefix occupation underscore and to add this now to the data frame we would say DF equals PD concat so concatenation and we want to take the data frame that we had before actually we need to put this in the list so we concatenate data frames that are part of a list one data frame is the one with the dummies here this is the second one and the first one is the data frame that we had before but we're going to drop a column in the column that we drop is going to be the occupation column because we don't want to keep the occupation when we have all the separate occupations already as a binary feature in here so we're going to drop the occupation X is equals one and we're going to concatenate all this and we're going to say axis equals one here as well for the concatenation and then when we print the data frame you can see we don't have occupation here anymore but we do have these binary features here and when we do this for all these columns we will end up with a lot of features which is fine uh but yeah you can you can try to play around you can drop certain features you can say okay I'm not going to want hot encode anything I'm going to do everything on an ordinal uh with an ordinal encoder so on a numerical scale and you can see if you get better results with that but we're going to go with one hot encoding for this video today so this now will be done for all these columns that have multiple categorical values so occupation I'm going to do that for work class we're going to do that for a work class underscore we're going to do that for the education um actually do we want to do that for the education because we do already have the educational number um I think the educational number is enough because as I said this is on a scale so maybe we don't want to one hot encode the educational uh we don't want to one hard to encode the education so we're just going to drop the education actually so DF equals DF drop and we're going to drop the education column because we do already have the educational number which is going to be our feature um then we're going to also drop here the marital Dash status marital Dash status underscore as a prefix here I'm going to do the same thing here for relationship this is a little bit tedious now because we need to just type the column names here relationship underscore and then we're going to do the same thing here for race and for do we have something else I think we have one in the end or we had one in the end which was native country so those are the ones that we're going to want hot encode so relationship race and Native country race underscore native Dash country underscore all right so we need to run this again here because we already encoded one column what's the problem here occupation uh why is that let me just rerun this again so reset the kernel Maybe run all of this and we still get the problem data frame object has no attribute occupation oh here oh of course because we're accessing this here as well so work class here um marital status here as well marital Dash status then relationship here then race here and native country native Dash country here this should work now we need to rerun this again and now we should only have uh these numerical features these binary features and the two features that we need to encode as binary features everything else now is everything except for income and the gender can now be used as a numerical or binary feature so to encode those two we're going to just say DF and then gender um is going to be equal to DF gender and we're going to apply here a Lambda expression Lambda X so for each value here we're going to say one if x equals male and else zero and the same will be done here for the income one if x is above 50 otherwise zero and now you can see that we have everything here being numerical so we don't have any categorical features as far as I'm concerned or as far as I know so we can also look at the columns here columns values and you can see that those are the columns that we have now and we do have quite a lot of them so 92 columns here in total um and those are the columns that we're going to use for training now what we can do for the purposes of visualization is we can also drop the least important columns so we can say here for example first of all let's maybe go with import Seaborn ask SNS import matplotlib dot pipelot splt we can just go ahead and plot a correlation heat map by saying PLT figure then figure size is going to be equal to 15 10 for example or maybe let's go with 18. 12 or something and then SNS heat map DF correlations so DF core annotation is going to be false because we have too much too many columns here and the color map is going to be cool warm so this is the correlation heat map now this basically says how the features are correlated and you can see that we cannot really see too much here so what's the problem here which column is that there seems to be a column that we cannot that we cannot uh use here so maybe let's go ahead and just print DF correlation which one is that income income is Nan why is in command because probably income has the same value uh every time right so let me just see here briefly what the problem with income is this usually means that we have only one value which is a problem so income yeah it's always zero why is that oh of course because I'm stupid because we need above 50k so we need to rerun this again here run everything from the top and now it should work okay so now in our heat map yeah we don't have any white rows so those are the correlations here and we want to see now which of the features are correlated highly correlated either negatively or positively correlated uh to the income and we can't really see that here because we have too many features so maybe what we can do is we can filter out all the features that are not correlated uh too much with income and we can only display the uh correlations between income and the high correlation features so we can go ahead and say um correlations is going to be equal to DF correlation and we only care about the income correlations in absolute values because a correlation of negative one is also very important and a correlation of one is very important just when you get close to zero it's insignificant so I want to say sorted correlations it's going to be equal to correlations sort underscore values and then what we want to do is want to say number off or maybe number calls to drop or something like that is going to just be the integer of let's say we drop 80 percent of the columns that are not so so the the bottom eighty percent of correlations we're going to drop them here for the sake of visualization we're going to use all of them for training or you could just use a reduced data set for training as well but we're going to use all of them for training but for the visualization we're going to drop 80 of the features with the lowest correlation so times length data frame columns and then columns to drop the actual columns to drop are going to be the sorted correlations and we're going to use ilock to say we want to have the features up until number columns to drop index and then we're going to say DF let's say dropped is going to be equal to DF drop we're going to drop all these columns here on axis one so this is going to be our DF dropped we now have only 19 columns so way less features and we can do the same thing here so we can go ahead and we can copy this code we can paste it down here and we can just replace DF with DF dropped and you can see here and here we can also actually say annotation true maybe we should reduce the figure size let's say 12 and 10 now that's a little bit too small 15 and 10 yeah that works so you can see now here this is actually too large doesn't matter we have the income here being correlated highly with what is this this is marital status it seems like whether you're married or not uh has a high correlation with the income same is true for the relationship status husband which is basically almost the same then we have here the educational level seems to be important the age seems to be important positively correlated so being married is good for your income being um older having a higher education obviously the gender also seems to be important so being a man because a man is a one and a woman is a zero being a man seems to be also beneficial at least in terms of correlation uh when it comes to income it could also be the other way around uh I mean no income cannot cause you being men male but it doesn't necessarily mean that you get higher pay just because uh you're a male but it could be correlated in some other way you also see here capital gain seems to be important hours per week worked which is also expected the more you work probably the more you earn negative correlation here with never married so if you have never been married this is probably bad for your income or it is at least negatively correlated um and here you have own child this is interesting does Own Child mean that you have only no it's not the same as only child right right what his own child actually not sure what that means but it's negatively correlated with income now what we're going to see here hopefully in the end is that correlations between features are not the same thing as feature importances because what we're going to do in this video today is we're going to train a random Forest classifier on this data set and this classifier that will perform decently will have future importances so it will tell us how important the individual features are and it is not always the case that the highly correlated features are the most important so this is interesting to know but those are the correlations here so let us move on to the next step we're going to train our random forest classifier and the reason we choose a random Forest uh classification here as a model is because the nature of this data set is very decision like if I may say it like that so we don't have too many numerical features we have a lot of binary features so yes or no features and these yes or no features are very similar or this yes or no way of getting to an answer is very similar to a decision tree because a decision tree has branches and nodes and we have either or so either you go left or you go right and I think that the decision tree is the most natural way to approach this data set and the random Forest is basically just an ensemble of decision trees which makes them or which makes the random Forest very powerful so this is what we're going to use in this video today we're going to say from sklearn dot Ensemble we're going to import the random Forest classifier we're going to import from sklearn dot model selection the train test split and then from sklearn uh or actually no we're not going to scale anything here because as I said we have mostly binary features so it doesn't really make I guess a lot of sense to do scaling maybe if the results are not good we're going to use the standard scaler but for now we're just going to go with the random first classifier in the simple train test split so we're going to say now train DF and test DF is going to be equal to train test split of the data frame with all the features and a test size of 0.2 so 20 of the data is going to be used for testing and this train test split also works with data frames not just with numpy arrays so it also is able to split to split data frames into a training data frame in this case with almost or not almost more than 39 000 rows and a test data frame with 9700 rows around 9700 rows so this is how we're going to do this and to actually now uh turn this or to actually now uh train the classifier we need to actually split this also into X and Y data so we have the data the features and we have the target value so what we're going to do is we're going to say train underscore X is equal to uh and we're going to say DF or actually train DF and then drop the income column on axis one so drop the column income and then we're gonna say train underscore Y is equal to train DF income like this and the same can be done for the testing set so Test X and test Y is going to be equal to test DF dropping the income and test DF income and now what we can do is we can say forest equals random Forest classifier and we're just going to leave the default parameters we're going to do hyper parameter tuning in a second here but we can say now forest.fit and we can fit it on the train X data and the train Y data this shouldn't take too long hopefully there you go and then we can evaluate it on the test data to see its performance by saying Forest score and then Mark Test X and Test log and you can see we get 85.64 accuracy which is quite good so 85 of the time were accurate in predicting whether this person is earning more or less than 50K um now to hopefully get even better results we can also do hyper parameter tuning but before we do that I want to show you here the feature importances which can change if you retrain a model here but we can just say forest dot feature importances and it will give us numbers for the importances of the different features now what we can do here is we can say we also want to have here the feature names but instead of now having to look up which one is which one we can also just go ahead and say dictionary zip so we can zip these two um two lists together so Forest dot feature names Forest dot feature importances we can zip them together turn them into a dictionary and then we can see the feature importances and furthermore we can also sort them so we can actually say uh let's say this is importances and then we want to say importances is equal to key and value for key and value in sorted and we can sort now the importances dot items and the key is going to be equal to a Lambda expression Lambda X one so we sort by the value by the importances in reverse order so in descending order then we can print the importances and as you can see here the most important feature seems to be actually this is important so we should look up what that actually means F and l w g t so it seems to be important do we have some information about that acknowledgments let's see um there you go continuous okay so those are just the values that don't describe it really maybe we should Google so this is what was the feature called F and l w g t this is final weight which is the number of units in the Target population where is this the weights on the current population survey files are controlled okay it doesn't really tell me what this is right final weight the number of people believes the entry ah okay so this is basically not really a feature it just tells us how many people belong to that group uh I don't know if that is maybe we should drop that feature actually maybe because it doesn't I don't think this is something that we can um really get for a new person at least I don't think so so probably we should drop that feature because uh before we before we train so maybe let's say DF equals DF drop that feature axis one final weight drop now let's rerun this again and see if we get a massively different score no still uh 84.95 percent pretty good and now let's see which one is the most important feature here the H the educational number the hours work per week so it seems like the older you get the higher you education and the more you work the higher your income is going to be seems reasonable then you can also see here marital status is very important um and it seems like gender is still important but not at all as important as it was in the correlation heat map then what else seems important race wide okay this now um yeah the interpretation of this gender and Race White is political so I leave that up to you guys I'm not going to comment on that but you can see here that this is how the random Forest says the features influence the results so this is how important the individual features are now let's go ahead do some hyper parameter tuning and see if those feature importances along with the performance of course uh changes or change so that we can see uh if this is consistent with the quote-unquote best model so what we're going to do now is we're going to say from sklearn dot model selection we're going to import grid search CV and we're going to say now that you want to have a parameter grid and the parameter grid is going to basically list the values that we want to have for the hyper parameters so you can look up the scikit-learn documentation for the random Forest classifier if you want to um so we can actually go and say uh random Forest classifier here skle learn Ensemble and we can see here what the parameters are what their default values are and what the parameters are for and you can use all of them here for uh hyper parameter tuning but what I'm going to do here is I'm just going to say for this parameter grid we're going to have the n estimators we're going to provide the default value here is 100 I think so we're going to say 50 100 and 250 then we're going to say um max depth so how deep can this go I'm going to limit this to 5 to 10 to 30 into none so no limit which is the default I think then we're going to say Min samples split so how many samples we need to split a note here we're going to say I think the default is two here we're going to say now two and four and then we're gonna say finally Max underscore features how are they going to be determined uh we're gonna say here square root I think this is the default and we're going to say log 2. all right so those are the different combinations here of parameters uh and those will actually be tested in combination so 50 with five and two and sqrt 50 with 5 and 2 and log 2 50 with 5 and 4 and SQ t50 with five and four and log 2 and then 50 with 10 and 2 and so on all these will be tried in combination so what we're going to do is we're going to say here grid underscore search is going to be equal to grid search CB and we're going to say the estimator that we're actually um tuning is going to be a random Forest classifier and then we're going to say that the parameter grid is going to be our parameter grid and want to have this verbose so that we can see um the progress over both is going to be equal to 10 and then we're just going to say grid search fit X underscore train y underscore train or actually in our case it is train X train y so you can see now it tries all these combinations so it does um it shows you the progress here and I'm going to skip that part because it's going to take some time and you can see here it keeps track of the score and then the best model will be returned as a result of this hyper parameter tuning all right so now the tuning is done and to get the best result here to get the optimal classifier what we do is we say grid search dot best estimator underscore and you can see that the best setting seems to be max depth of 30 and Min sample split of four now theoretically if you want to further explore this you can see here let's just close this here you can see that the max depth of 30 is the maximum value that we have besides none so we could also explore maybe 50 or 60 or something like that and the same is true for uh where is it Min sample split four maybe six eight ten or something like that is even better so if you want to you can also further explore that but let's go ahead now and see what the accuracy is with this one grid search best estimator is going to be the forest so Forest is equal to grid search best estimator and then let's do Forest DOT score Test X and Test why we get 86.12 percent which is very good and now what we can do is we can look at the feature importances to see or actually we can rerun this code here let me just copy this again there you go and the importance is now our still age educational number capital gain marital status hours per week so it seems like hours per week is now less important with this better model quote unquote for whatever reason and yeah actually it seems also like yeah Race White is less important for this one and gender is still about the same not in family owned child wife seems to also uh be somewhat important the least important feature of all is whether you're from Netherlands or whether you have never worked this is interesting work class never worked is irrelevant it seems to this decision which is interesting because if you have never worked I think that is a good predictor for not having high income but maybe it is because we don't have too many people that have never worked could that be true yeah we only have 10 instances of neverworked so yeah but this is how you can build a machine learning model that predicts people's incomes so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you next video and bye

Original Description

Today we build a machine learning model, that predicts people's income. Dataset: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset ◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾ 📚 Programming Books & Merch 📚 🐍 The Python Bible Book: https://www.neuralnine.com/books/ 💻 The Algorithm Bible Book: https://www.neuralnine.com/books/ 👕 Programming Merch: https://www.neuralnine.com/shop 💼 Services 💼 💻 Freelancing & Tutoring: https://www.neuralnine.com/services 🌐 Social Media & Contact 🌐 📱 Website: https://www.neuralnine.com/ 📷 Instagram: https://www.instagram.com/neuralnine 🐦 Twitter: https://twitter.com/neuralnine 🤵 LinkedIn: https://www.linkedin.com/company/neuralnine/ 📁 GitHub: https://github.com/NeuralNine 🎙 Discord: https://discord.gg/JU4xr8U3dm

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NeuralNine · NeuralNine · 0 of 60

← Previous Next →

Visualizing Stock Data With Candlestick Charts in Python

Visualizing Stock Data With Candlestick Charts in Python

Python Beginner Tutorial #1 - Installation and First Program

Python Beginner Tutorial #1 - Installation and First Program

Python Beginner Tutorial #2 - Variables and Data Types

Python Beginner Tutorial #2 - Variables and Data Types

Python Beginner Tutorial #3 - Operators and User Input

Python Beginner Tutorial #3 - Operators and User Input

Python Beginner Tutorial #4 - If Statements and Conditions

Python Beginner Tutorial #4 - If Statements and Conditions

Python Beginner Tutorial #5 - Loops

Python Beginner Tutorial #5 - Loops

Python Beginner Tutorial #6 - Sequences and Collections

Python Beginner Tutorial #6 - Sequences and Collections

Python Beginner Tutorial #7 - Functions

Python Beginner Tutorial #7 - Functions

Python Beginner Tutorial #8 - Exception Handling

Python Beginner Tutorial #8 - Exception Handling

Python Beginner Tutorial #9 - File Operations

Python Beginner Tutorial #9 - File Operations

Python Beginner Tutorial #10 - String Functions

Python Beginner Tutorial #10 - String Functions

Python Intermediate Tutorial #1 - Classes and Objects

Python Intermediate Tutorial #1 - Classes and Objects

Python Intermediate Tutorial #2 - Inheritance

Python Intermediate Tutorial #2 - Inheritance

Python Intermediate Tutorial #3 - Multithreading

Python Intermediate Tutorial #3 - Multithreading

Python Intermediate Tutorial #4 - Synchronizing Threads

Python Intermediate Tutorial #4 - Synchronizing Threads

Python Intermediate Tutorial #5 - Events and Daemon Threads

Python Intermediate Tutorial #5 - Events and Daemon Threads

Python Intermediate Tutorial #6 - Queues

Python Intermediate Tutorial #6 - Queues

Python Intermediate Tutorial #7 - Sockets and Network Programming

Python Intermediate Tutorial #7 - Sockets and Network Programming

Python Intermediate Tutorial #8 - Database Programming

Python Intermediate Tutorial #8 - Database Programming

Python Intermediate Tutorial #9 - Recursion

Python Intermediate Tutorial #9 - Recursion

Python Intermediate Tutorial #10 - XML Processing

Python Intermediate Tutorial #10 - XML Processing

Python Intermediate Tutorial #11 - Logging

Python Intermediate Tutorial #11 - Logging

Python Data Science Tutorial #1 - Anaconda and PyCharm Setup

Python Data Science Tutorial #1 - Anaconda and PyCharm Setup

Python Data Science Tutorial #2 - NumPy Arrays

Python Data Science Tutorial #2 - NumPy Arrays

Python Data Science Tutorial #3 - Numpy Functions

Python Data Science Tutorial #3 - Numpy Functions

Python Data Science Tutorial #4 - Plotting Functions With Matplotlib

Python Data Science Tutorial #4 - Plotting Functions With Matplotlib

Python Data Science Tutorial #5 - Subplots and Multiple Windows

Python Data Science Tutorial #5 - Subplots and Multiple Windows

Python Data Science Tutorial #6 - Matplotlib Styling

Python Data Science Tutorial #6 - Matplotlib Styling

Python Data Science Tutorial #7 - Bar Charts with Matplotlib

Python Data Science Tutorial #7 - Bar Charts with Matplotlib

Python Data Science Tutorial #8 - Pie Charts with Matplotlib

Python Data Science Tutorial #8 - Pie Charts with Matplotlib

Python Data Science Tutorial #9 - Plotting Histograms with Matplotlib

Python Data Science Tutorial #9 - Plotting Histograms with Matplotlib

Python Data Science Tutorial #10 - Scatter Plots with Matplotlib

Python Data Science Tutorial #10 - Scatter Plots with Matplotlib

Python Data Science Tutorial #11 - 3D Plotting with Matplotlib

Python Data Science Tutorial #11 - 3D Plotting with Matplotlib

Python Data Science Tutorial #12 - Pandas Series

Python Data Science Tutorial #12 - Pandas Series

Python Data Science Tutorial #13 - Pandas Data Frames

Python Data Science Tutorial #13 - Pandas Data Frames

Python Data Science Tutorial #14 - Pandas Statistics

Python Data Science Tutorial #14 - Pandas Statistics

Python Data Science Tutorial #15 - Pandas Sorting and Functions

Python Data Science Tutorial #15 - Pandas Sorting and Functions

Python Data Science Tutorial #16 - Pandas Merging Data Frames

Python Data Science Tutorial #16 - Pandas Merging Data Frames

Python Data Science Tutorial #17 - Pandas Queries

Python Data Science Tutorial #17 - Pandas Queries

Python Machine Learning Tutorial #1 - What is Machine Learning?

Python Machine Learning Tutorial #1 - What is Machine Learning?

Python Machine Learning Tutorial #2 - Linear Regression

Python Machine Learning Tutorial #2 - Linear Regression

Python Machine Learning Tutorial #3 - K-Nearest Neighbors Classification

Python Machine Learning Tutorial #3 - K-Nearest Neighbors Classification

Python Machine Learning #4 - Support Vector Machines

Python Machine Learning #4 - Support Vector Machines

Python Machine Learning Tutorial #5 - Decision Trees and Random Forest Classification

Python Machine Learning Tutorial #5 - Decision Trees and Random Forest Classification

Python Machine Learning Tutorial #6 - K-Means Clustering

Python Machine Learning Tutorial #6 - K-Means Clustering

Python Machine Learning Tutorial #7 - Neural Networks

Python Machine Learning Tutorial #7 - Neural Networks

Python Machine Learning Tutorial #8 - Handwritten Digit Recognition with Tensorflow

Python Machine Learning Tutorial #8 - Handwritten Digit Recognition with Tensorflow

Generating Poetic Texts with Recurrent Neural Networks in Python

Generating Poetic Texts with Recurrent Neural Networks in Python

Stock Portfolio Visualization with Matplotlib in Python

Stock Portfolio Visualization with Matplotlib in Python

Analyzing Coronavirus with Python (COVID-19)

Analyzing Coronavirus with Python (COVID-19)

Making Text Images Readable Again with Python and OpenCV

Making Text Images Readable Again with Python and OpenCV

Neural Networks Simply Explained (Theory)

Neural Networks Simply Explained (Theory)

Motion Filtering with OpenCV in Python

Motion Filtering with OpenCV in Python

Top 5 Programming Languages To Learn in 2020

Top 5 Programming Languages To Learn in 2020

Simple TCP Chat Room in Python

Simple TCP Chat Room in Python

Image Classification with Neural Networks in Python

Image Classification with Neural Networks in Python

Edge Detection with OpenCV in Python

Edge Detection with OpenCV in Python

S&P 500 Web Scraping with Python

S&P 500 Web Scraping with Python

Simple Sentiment Text Analysis in Python

Simple Sentiment Text Analysis in Python

Introduction - Algorithms & Data Structures #1

Introduction - Algorithms & Data Structures #1

This video teaches how to build a machine learning model for income prediction using the Adult Income dataset, covering data preprocessing, feature engineering, and model training with Random Forest classifier and hyperparameter tuning using Grid Search CV. The project demonstrates how to use Python libraries such as pandas, scikit-learn, and numpy for data science and machine learning tasks.

Key Takeaways

Download the Adult Income dataset from Kaggle
Preprocess the data by encoding categorical features and dropping irrelevant columns
Split the data into training and testing sets
Train a Random Forest classifier on the training data
Perform hyperparameter tuning using Grid Search CV
Evaluate the model on the testing data and visualize feature importances

💡 Hyperparameter tuning using Grid Search CV can significantly improve the accuracy of the machine learning model

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related Reads

What Is MLIR and Why Does It Exist?

Learn about MLIR, a intermediate representation for machine learning models, and its purpose in optimizing ML workflows

Dev.to · Fedor Nikolaev

Why Choosing the Right Machine Learning Development Company Matters More Than the AI Model

Choosing the right machine learning development company is crucial for turning AI investments into measurable results, as it can make or break the success of AI projects

Medium · Machine Learning

Data privacy in AI training: federated learning, differential privacy, and synthetic data

Learn how federated learning, differential privacy, and synthetic data preserve data privacy in AI training, and why they matter for secure machine learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB