Feature Encoding 101: Prepare Data For Machine Learning
Key Takeaways
Feature encoding methods for machine learning, including numerical representations of various features
Full Transcript
what is going on guys welcome back in this video today we're going to learn about the different methods that we can use to encode features for machine learning so let us get right into it not a g it's AED all right so we're going to talk about feature encoding in this video today and to be precise we're going to talk about the most important ways in which we can encode features for machine learning in Python today and for this I have prepared a Jupiter notebook the basic idea of feature encoding is that when we work with data sets we get a lot of features that are not numerical and when we train machine learning models be a neural network a support Vector machine um a logistic regression whatever we train it's basically just calculations with numbers so every feature that we have at some point has to become a number a numerical value and because of that we need to encode the features that are not numerical values by default stuff like male female stuff like cabin numbers that contain characters stuff like locations stuff like categorical values here s c q and so on all these values are not ready to be used right away by Machine learning models because they are not numbers we cannot just train a model on them we need to somehow turn them into numbers however turning them into numbers can happen in a variety of different ways and some of them are more intelligent and more appropriate for specific types of features than others and this would we're going to talk about in this video today I want to give you an overview over the different encoding methods and I want to uh tell you or I want to explain to you which one is the most appropriate for what kind of features so we're going to start with a very simple one first which is the label encoder and actually we're going to combine the label encoder right away with the ordinal encoder because they are quite similar the only difference is the ordinal encoder um specifies an order so basically let me give you briefly an example here in my paint so that you understand the difference let's say we have a feature called favorite programming language and it has values like C++ C python Java hascal and so on and this feature now needs to be encoded into a numerical feature now label encoding would just take the values and map numbers to them so turn them into numbers C++ would become one C would become two python would become three Java would become four has would become five and it doesn't really matter what the order is we could also swap these numbers uh the important thing is that every textual value or every categorical value has a numerical value assigned to it that is not the same as for the other ones so to be able to distinguish that now this is not optimal for a feature like this because there is no progression there 1 2 3 4 5 numeric ly are progressing they're getting higher but C++ C python Java hascal don't really progress now of course you can choose to somehow order them in a way that they progress so that they progress like maybe closest to the hardware to the most abstract or release date or popularity or something but in general for this kind of feature it doesn't make a lot of sense to talk about order it makes sense to encode them in a different way which we're going to talk about as well in a second here uh but that would be label encoding just taking the values and assign numbers to them now ordinal encoding would be more useful for features like education level so if I have a feature education level or let's say highest education level um and I have values like Elementary School High School Junior High School uh bachelor's degree master's degree PhD and so on these values have an inherent order it doesn't make sense for example to say um I want to have master's degree then I want to have Elementary School uh then I want to have high school it doesn't make sense to say one two three in this case because the progression could be there and if you if you have all these values and you just shuffle them and you just assign random numbers to them you're basically taking away the ability or making it very hard for the model to spot the patterns uh where you can say okay uh this value here is higher and this means that your education level is higher it it has a harder time to spot this pattern of of a progression of of an actual of a feature that has an order in it and it just treats it as a basic categorical feature so what you want to do with ordinal encoding is you want to order it in the correct way so for example Elementary then maybe Junior High then maybe high school then maybe um bachelor's degree then maybe master's degree then maybe PhD whatever and then you can assign the values one two three four because now also it makes sense to have correlations maybe you have another feature income and now income has a high correlation maybe with education level because the higher the education level the higher the income but if I just Shuffle these randomly and I do label encoding I won't have such a correlation because maybe PhD is one and then master's degree is six and in between three I have maybe here Elementary School and so on and that would kill the correlation so that is the between these two methods and you can see them here in code label encoder just takes the feature in this case we work with embarked embarked has three values s c and Q which are just locations and um when we encode them using label encoding it just encodes them in some way so we have um up here s is the first one so here we have S actually uses 0 one2 not 1 2 3 uh but you can see here we have q s and C s is 2 Q is 1 C is zero now ordinal encoder does the exact same thing only difference is I can provide an order s CQ which means s is going to be zero uh you can see the difference here s this is the previous one the label encoding and this is now the ordinal encoding zero so 012 that's the difference here uh now then we have an encoding method that is more useful for Cate categorical values like the programming language which is the one hot encoding now this is only a good choice if the cardinality is not too high so if you don't have too many different values so what we basically do with one hot encoding is take the feature again like favorite programming language um if I have features like C and Python and c and Java what I do now is I turn each of these values U for this feature I turned them into separate features so now these are no longer values each of them are now uh each of them is now a binary feature so I have the feature favorite programming language C favorite programming language python favorite programming language C favorite programming language Java and so on and each of them can be zero or one so if my favorite programming language is python the value before was python a text value for this feature now it would be these four features if these are the only languages that exist in the data set it would not be C it would be python yes it would not be C and not Java so now I have four binary features four numerical features it can be Z or one uh rather than a feature that has four different textual values now the problem is if I have all the programming languages that exist here this has uh this is a cardinality that's too high it doesn't make a lot of sense to do it in this way uh if that's the case but let's say I have a pool of 10 program programming languages to choose from that would be a good idea then to use one hot encoding because I just have these binary features which are of course especially useful for something like decision trees where you can just say yes no yes no go left go right um so if my favorite programming language will CP 1 0 0 0 and so on so that is the idea of one hot encoding we can see that in action in our code here uh in this case I'm using the get dummies method from pandas you can also use the one hot encoder from could learn doesn't really matter but what's Happening Here is I have the embarked column and I have SQ and C but now if I go to the right you can see we have embarked c embarked q embarked s and the value is s so we get 0 0 one in the S column here it's Q so we get zero one in the Q column and then zero so that is one hot encoding this makes more sense if your feature is a categorical feature and doesn't have inherent order like for example the Ed education level has inherent orders so it would make more sense to ordinal encode it but for one hot encoding uh or for categorical features it makes more sense to use one hot encoding uh then we also have binary encoding and binary encoding is more useful when we have high cardinality features so if we have many many values what we basically do is we do something not really similar to to one hot encoding but we again produce multiple columns with zero ones but they now represent binary numbers so to keep this simple I'm going to give you an example again um if I have the values Python and Java and C C++ CP hascal rust uh now we have seven these would be seven values that I can possibly have now for those of you who understand the binary system you know that uh zero in the binary system is represented like this if you just use three um three bits one would be this and then you would have uh two and then you would have uh three and then you would have four and then you would have uh five and then you would have then you would have uh six and then you could actually add another one which would be seven uh so basically what we're doing with binary encoding is we're mapping one such representation to each of the values so we say that rust for example has the value 110 and instead of just storing this as a string we store this in in this case since we have only seven values we store this as a uh as three separate column so seven values can be represented using three bits because uh two to power two to the power of three is eight um and what I can do now is I can just say bid one or yeah bid one bit two bit three or actually maybe the other way around but then basically I could say I have these three columns and I can do it like this so that is binary encoding and this needs way less columns if I have more features so um we can take a look at this I use embarked here again embarked only has three values so I can use just two bits and you can see I have embarked Z embarked one and then the individual values are represented as 0 1 1 0 and uh actually I'm not sure what the other one is represented as but that's the basic idea we represent these features as binary numbers now then we have stuff that loses data so frequency encoding is one way to encode a feature that uses a lot of information about that feature but it can make sense to use it if you have a strong correlation with a Target variable when it comes to the frequency and basically what we're doing is we're saying okay I'm losing the actual value of the feature but I replace it with the frequency of that value for that feature so instead of saying C I say how often does c occur as a value of embarked and then I take that uh relative frequency 20% of the time as the value that I'm using so I'm losing a lot of information as I said but I have a different information now how often is this feature the one that or how often is this value the one that is the value for this feature uh you can see that uh 20% of people embarked at C almost 70% of people embarked at s and then uh 9% or almost 10% of people embarked at uh Q so that is a way to to encode this it ends up being a number of course you have to drop the original feature unless you want to do also some other encoding method but you do have a number but this number is not the same as the feature itself but sometimes it may be enough sometimes it may make sense to use that and this is just another way in which you can encode a feature another way that also loses information but can Al uh also oftentimes be quite um useful and oftentimes can even lead to overfitting is Target encoding which means that I'm taking the mean value uh for this feature so so the mean Target value for this feature value so to to to say this to to maybe give you an example let me open up uh my paint again if I have for example uh let's say I have programming language or maybe let's go with education level and then I have uh high school and I have uh bachelor's degree master's degree PhD let's say um and I want to say that or or my model is predicting the income that this person has then what I can do is I can instead of encoding the education itself I can say give me the incomes of all the people that have a high school uh education take the mean of that and just represent instead of saying education just take the education and replace it by the mean value by the mean Target value Target because we're trying to predict income by the mean income for this education level and I do the same thing for the bachelor's degree I do the same thing for the Master's Degree just all the values taking the average and then replacing the education level with the average income for that education level so in the case of embarked here um we are saying survived is I mean survived is a binary value so it doesn't make a lot of sense probably here but basically or actually it can make sense but basically we're saying for the people that embarked at this uh position what is the mean uh survival rate so 33% for all of them actually or actually not for all of them no uh 33% for S then we have I think 55% for C so it makes a difference where you Embark it seems and what you do is you just replace the feature by the mean Target value for that feature group I hope that makes sense so basically we're saying all the people that embarked at s take them how many of them survived on average and then use this as my encoding for the location for the embarked value um that's what we're doing here and then the last thing that I want to talk about is the embeddings embeddings basically are uh trainable representations so think about it that way this is something that is also very relevant nowadays because we're working with large language models and Transformers do this and um also recurrent networks oftentimes do this um or certain types of recurrent networks maybe uh U Iz this but especially Transformers uh embeddings are basically taking some some values some tokens some ID and putting it into highly into high dimensional Vector space so if I say let's say I have some some let's visualize three dimensions for now to keep it simple but in in reality we have uh many more Dimensions let's say I have a value cat and Cat what would happen happens now with cat is I might turn it into a number one and then I might take this number and turn it into a three-dimensional vector and this would result in somewhere in this 3D space there's a point cat and maybe I have also dog and dog might be to and dog also is turned into a vector that is not too far away from cat maybe so here and then maybe I have um computer as as a term and I give it a three and I learn over time that a good embedding for computers here and then the idea is that the individual values are going to be placed in Vector space so that similar stuff is closed together because these two are animals and here maybe we have laptop and stuff like this and they're closed together as well and you can even do stuff like for example uh if I have human here this is now getting a little bit messy but if I have human here and um maybe I have something like programmer uh it would be somewhere in between computer and human so I don't know maybe here that would just just a simple example here or maybe you could have something like programmer and then um you could have something like python here and then you could have uh I don't know in this direction you could have have something like matlb then you would have something like idiot here no okay just kidding uh but but you get the idea you could have um these these different um places in Vector space and there is not really a concrete way on where to put this there's not really like a correct way where you put these points these are learned representation which means that we train a neural network to figure out how to place these points so that they are meaning fully placed so you have let's say 100 Dimensions or you have uh a th000 dimensions and you can just place the point somewhere and the neural network has to figure out where do I place these points to make sure that I can work with them in a reasonable way in a good way um so what we do here just an example is uh without training this is just showing what happens uh randomly initialized we have a pie torch here we say we want to have an embedding with 150 unique values we allow for 150 unique values we have an embedding dimension of eight which means our vectors have eight values and we just take some identifiers here and we embed them into Vector space as a result we get these eight dimensional vectors and these are just positions in the eight dimensional Vector space which of course now are random but this is trained so that these positions are actually optimal or useful at least um and here we do this for our embarked values what we do is we create an embedding layer we say we want to have uh three different uh three different values and we want to have also eight dimensions and then we take our embarked column and we put it into Vector space again randomly initialized but can be trained to uh be more useful so these are the most important feature en coding methods that you need to know to just briefly recap label encoder just assigns values numerical values to the individual values uh to the individual non-numerical values ordinal encoder does the same thing but with an order so we can specify uh the orders so that 1 2 3 4 and so on are assigned meaningfully then we have one heart encoding which takes the individual values of a categorical feature and turns them into binary features then we have binary encoding which uses binary numbers so bit representations to allow for more categorical values while needing less columns uh then we have frequency encoding where we replace the feature by how often it is the actual value of that feature so we replace the feature value by the frequency of this feature value in the feature Target encoding is then just taking the value of the feature and replacing it with the mean Target value for that feature value uh and then we have embeddings which we just talked about which are representations in Vector space which can be learned to be more useful so that's it for today's video I hope you enjoyed it and hope you learn something if so let me know by hitting the like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye
Original Description
Today we learn about various feature encoding methods. These are important in order to turn all sorts of features into meaningful numerical representations.
◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾
📚 Programming Books & Merch 📚
🐍 The Python Bible Book: https://www.neuralnine.com/books/
💻 The Algorithm Bible Book: https://www.neuralnine.com/books/
👕 Programming Merch: https://www.neuralnine.com/shop
💼 Services 💼
💻 Freelancing & Tutoring: https://www.neuralnine.com/services
🌐 Social Media & Contact 🌐
📱 Website: https://www.neuralnine.com/
📷 Instagram: https://www.instagram.com/neuralnine
🐦 Twitter: https://twitter.com/neuralnine
🤵 LinkedIn: https://www.linkedin.com/company/neuralnine/
📁 GitHub: https://github.com/NeuralNine
🎙 Discord: https://discord.gg/JU4xr8U3dm
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from NeuralNine · NeuralNine · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Visualizing Stock Data With Candlestick Charts in Python
NeuralNine
Python Beginner Tutorial #1 - Installation and First Program
NeuralNine
Python Beginner Tutorial #2 - Variables and Data Types
NeuralNine
Python Beginner Tutorial #3 - Operators and User Input
NeuralNine
Python Beginner Tutorial #4 - If Statements and Conditions
NeuralNine
Python Beginner Tutorial #5 - Loops
NeuralNine
Python Beginner Tutorial #6 - Sequences and Collections
NeuralNine
Python Beginner Tutorial #7 - Functions
NeuralNine
Python Beginner Tutorial #8 - Exception Handling
NeuralNine
Python Beginner Tutorial #9 - File Operations
NeuralNine
Python Beginner Tutorial #10 - String Functions
NeuralNine
Python Intermediate Tutorial #1 - Classes and Objects
NeuralNine
Python Intermediate Tutorial #2 - Inheritance
NeuralNine
Python Intermediate Tutorial #3 - Multithreading
NeuralNine
Python Intermediate Tutorial #4 - Synchronizing Threads
NeuralNine
Python Intermediate Tutorial #5 - Events and Daemon Threads
NeuralNine
Python Intermediate Tutorial #6 - Queues
NeuralNine
Python Intermediate Tutorial #7 - Sockets and Network Programming
NeuralNine
Python Intermediate Tutorial #8 - Database Programming
NeuralNine
Python Intermediate Tutorial #9 - Recursion
NeuralNine
Python Intermediate Tutorial #10 - XML Processing
NeuralNine
Python Intermediate Tutorial #11 - Logging
NeuralNine
Python Data Science Tutorial #1 - Anaconda and PyCharm Setup
NeuralNine
Python Data Science Tutorial #2 - NumPy Arrays
NeuralNine
Python Data Science Tutorial #3 - Numpy Functions
NeuralNine
Python Data Science Tutorial #4 - Plotting Functions With Matplotlib
NeuralNine
Python Data Science Tutorial #5 - Subplots and Multiple Windows
NeuralNine
Python Data Science Tutorial #6 - Matplotlib Styling
NeuralNine
Python Data Science Tutorial #7 - Bar Charts with Matplotlib
NeuralNine
Python Data Science Tutorial #8 - Pie Charts with Matplotlib
NeuralNine
Python Data Science Tutorial #9 - Plotting Histograms with Matplotlib
NeuralNine
Python Data Science Tutorial #10 - Scatter Plots with Matplotlib
NeuralNine
Python Data Science Tutorial #11 - 3D Plotting with Matplotlib
NeuralNine
Python Data Science Tutorial #12 - Pandas Series
NeuralNine
Python Data Science Tutorial #13 - Pandas Data Frames
NeuralNine
Python Data Science Tutorial #14 - Pandas Statistics
NeuralNine
Python Data Science Tutorial #15 - Pandas Sorting and Functions
NeuralNine
Python Data Science Tutorial #16 - Pandas Merging Data Frames
NeuralNine
Python Data Science Tutorial #17 - Pandas Queries
NeuralNine
Python Machine Learning Tutorial #1 - What is Machine Learning?
NeuralNine
Python Machine Learning Tutorial #2 - Linear Regression
NeuralNine
Python Machine Learning Tutorial #3 - K-Nearest Neighbors Classification
NeuralNine
Python Machine Learning #4 - Support Vector Machines
NeuralNine
Python Machine Learning Tutorial #5 - Decision Trees and Random Forest Classification
NeuralNine
Python Machine Learning Tutorial #6 - K-Means Clustering
NeuralNine
Python Machine Learning Tutorial #7 - Neural Networks
NeuralNine
Python Machine Learning Tutorial #8 - Handwritten Digit Recognition with Tensorflow
NeuralNine
Generating Poetic Texts with Recurrent Neural Networks in Python
NeuralNine
Stock Portfolio Visualization with Matplotlib in Python
NeuralNine
Analyzing Coronavirus with Python (COVID-19)
NeuralNine
Making Text Images Readable Again with Python and OpenCV
NeuralNine
Neural Networks Simply Explained (Theory)
NeuralNine
Motion Filtering with OpenCV in Python
NeuralNine
Top 5 Programming Languages To Learn in 2020
NeuralNine
Simple TCP Chat Room in Python
NeuralNine
Image Classification with Neural Networks in Python
NeuralNine
Edge Detection with OpenCV in Python
NeuralNine
S&P 500 Web Scraping with Python
NeuralNine
Simple Sentiment Text Analysis in Python
NeuralNine
Introduction - Algorithms & Data Structures #1
NeuralNine
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Python
🎓
Tutor Explanation
DeepCamp AI