Spam Classifier using Naive Bayes in Python

Aladdin Persson · Beginner ·📐 ML Fundamentals ·6y ago

Key Takeaways

This video demonstrates how to build a spam classifier using Naive Bayes in Python, covering the implementation of a Naive Bayes classifier to classify emails into spam and not spam, and utilizing tools such as pandas, numpy, and NLTK for data processing and tokenization. The video provides a step-by-step guide on how to use these tools to achieve a 92% accuracy on training data.

Full Transcript

welcome back for part two now I want to see how we can use this naive Bayes implementation to classify emails into spam or not spam so let's just first look at sort of the data set that we have I'll link in the description so you can download this data set if you want to look at it as well but really just we have so we have two columns here one for the email so this is for example a spam one for spam so it's like save your money by getting this thing here cetera and then if we scroll down we have sort of examples of of non spam and in total we have let's see it's about six thousand emails and so what we want to do is first of all we want to build a vocabulary and we're gonna build the vocabulary from the words that we see in the email in the data set so there are essentially two ways you could do this and do this you could download a vocabulary English vocabulary list dictionary and use that but we're gonna build our own just from the data set so let's start with that or we want to do first is we want to import pandas as PD we want to import numpy as NP and also we're gonna from NLT k that corpus we're going to input words so this one has we're gonna use the NLT K words for kind of looking through each word in our email and we're gonna see is this an actual word in English word and then we're gonna add it to our dictionary because there can be a lot of nonsensical words written in emails so we're gonna essentially just take a subset of all the words in the email capillary which is an empty dictionary we're gonna do data which is gonna read CSV and now it's in data data sub sub folder and then emails dot CSV then we're going to NLT k dot download words and then we're just gonna do set words we're gonna do this words that words into a set which makes it run a lot faster you know yeah okay so we need to import NLT k as well and then we're gonna define build vocabulary let's just do it for current email so one email at a time and one thing that we're gonna use later when we build sort of the X&Y numpy arrays is we're gonna use sort of a specific feature so the zero feature for example the first feature is a specific word so we're gonna keep track of the words and we're gonna store them at a specific index so what we're gonna do first we're gonna forward in current email and then we're gonna do if we're that lower not in vocabulary so if we don't have this word and word got lower in set words so essentially if we don't have the the word stored and it's also an English word then we're gonna add it but we're gonna add and we're gonna add it to a specific index and then we're gonna add index plus one and we're gonna say that the index is the length of the vocabulary at this current time so you can essentially see that we're when we're adding a first word it's gonna be index zero and then the second word is index one third word index three etc and then all we're gonna do is if name equals main we're gonna do for I in range of let's see we need to also low our data yes we loaded data here so we're gonna do four iron range of data that shape of zeros that's all of the the emails then we're going to do current email we're gonna use data that I look from the pandas and then we're gonna do for that specific training example all right then we're gonna take all of yeah we're gonna take the zero one for the email and then we're gonna do dot split let's do print current email is let's see I of so how many we've processed so far and the let's see the length of vocabulary is so sort of just to see how how many have a process so far and what is the current length of our vocabulary then we're just gonna call build vocabulary of our current email and then what we're gonna do in the end is we're gonna do after this is run we're going to file open and then we're gonna open vocabulary txt and then we're gonna do bright so all we're gonna do is we're just gonna print this entire dictionary to a ship to a text file and then if I like right string of vocabulary as a string and then we're just going to file close and yeah so okay we don't actually need an umpire it seems like remove that one and all right so let's try to run this and we get so this is kind of a small data set we just have 5,000 emails and we can see that at the end we have 12,000 words in our dictionary and of course if we increased the number of emails there would be more words so the vocabulary would increase but for this data set we just have 12,000 so now we want to go through each email and map each email into some X&Y data set that we can then send in to our naivebayes implementation and the way we're gonna do this is let's say we have some dictionary okay some the vocabulary that we just created let's say that we have some words let's say we have our quark and then we have some words then we have bye and then we have a say money is in our vocabulary and then we have some last word let's say Zulu and what we're gonna do is we're gonna go through each email and we're gonna count how many times does this word exist in this specific email and let's say it's 0 times and then we're gonna you know get to buy and we're gonna see maybe it's eight times and then we're gonna see you know let's say money and it's 10 times and yeah so we have then lastly let's say 0 times we're sort of gonna map each email into some frequency array and we're gonna do this for each email and this is then gonna be the input bill naivebayes so we have 12,000 words in our vocabulary so each training example is gonna be 12,000 features all right so let's do import pandas as pv import numpy as NP and then we're gonna use some library called P port just and this is just to read the dictionary file that we created so first we're gonna do data that panda read CSV and we're gonna read emails then we're going to open the vocabulary text and we're gonna read it then we're gonna do content is file that read and we're gonna do vocabulary is asked that literal eval and then contents and so this is so we read it and it's a string and it's gonna evaluate it into a dictionary there might be better ways to do this but we're not really gonna focus on that we just get a dictionary back now that we print it to the text file and then we're gonna create our X we're going to MP dot zeroes and we get the data that shape of 0 because we're gonna have all of the training examples then we're gonna have the length of our vocabulary so that's the dimension for each example and then Y is just going to be MP zeroes and data the shape of 0 and we're gonna do for I in range data shape zero we're gonna get our current email I'm gonna do this not split and then we're gonna do for email word in email so we're gonna go through each word in the email we're gonna do if female dot email word that lower in vocabulary then we're getting so if the if it isn't new vocabulary then we're gonna add that add one to the specific column where we restored that that word in our in our so in our X for example so which feature belongs to that specific word and since we stored in the vocabulary we stored in index which represents the like the column for that specific word we can just do vocabulary of email word and then we're going to plus equals one and similarly I of I is gonna be the the class of that one which is stored in in this in the first column or this second column of data so data I comma one and yeah so let's then print for example X of zero and all so we print how does the the feature vector look for the first email all right so we get we get back something like this not really sure which belongs to which which word is the first one in second but we can see that there's a word here that it repeats itself seven times through the email so we can print sort of the shape of X and we get so we have about six thousand training example and each example is about twelve thousand dimensional feature vector so what we want to do then is we want to store this in an array so we can load it for the night when we send it into Navy base so we can do X is MP dot correctly just 2 MP dot save and the storage in data and then X dot and py I'm just gonna store X and then also NP dot save data Y that MP Y and then just Y and let's see invalid syntax yeah so it should be a comma here and let's run this and we should have it saved in that folder and as we sort of can see now we have the email CSV file and then we also have the entire 5 I guess 6000 by 12000 matrix for the data and we also have the wire labels right here in mpy files so let's now go back to the naive Bayes implementation right the one we have right here and we're gonna change how we load the data here so we're just gonna do let's comment out this actually and just do print or actually X is MP dot load and then data and then X dot MP Y then Y is MP dot load data Y dot MP Y and then print x and y shape and then do the prediction and you get to plate it again you can split it in to train and test data we just use the training data and let's see so yeah we get we loaded and we get the shapes that we expect and then training so it was pretty quick 92% accuracy which I guess doesn't say that much since we need to know how many are spam but we can check that how many some of Y so divided by Y that shape of 0 so about 24% spam and yeah so it seems to do reasonable and this is again like this is the training data so might do worse on test data of course so one thing we could do is we could check let's see what we predict to be spam and we could do how that looks like so we can do data dot I look and then for example the so all of the first three ones we can do two and then we could just do the the email and then dot split and we can read that Oh actually not that's please let's just print like this yeah so we can see here they want to sell some some home or home loan whatever it seems like spam and we could also do let's say 2500 I know this one is not spam so we could see how that looks like and yeah so one thing you could do is you can sort of look at how it classifies and make sure that it makes sense and it seems to make sense for for this this data that we have thank you so much for watching the video and leave a comment if you have any questions and I hope to see you in the next video [Music]

Original Description

In this video we use our implementation from the last video of naive bayes classifier to classify emails into spam and not spam using the Python programming language. Other words for this: Spam filtering and spam detection using Naive Bayes. ❤️ Support the channel ❤️ https://www.youtube.com/channel/UCkzW5JSFwvKRjXABI-UTAkQ/join Paid Courses I recommend for learning (affiliate links, no extra cost for you): ⭐ Machine Learning Specialization https://bit.ly/3hjTBBt ⭐ Deep Learning Specialization https://bit.ly/3YcUkoI 📘 MLOps Specialization http://bit.ly/3wibaWy 📘 GAN Specialization https://bit.ly/3FmnZDl 📘 NLP Specialization http://bit.ly/3GXoQuP ✨ Free Resources that are great: NLP: https://web.stanford.edu/class/cs224n/ CV: http://cs231n.stanford.edu/ Deployment: https://fullstackdeeplearning.com/ FastAI: https://www.fast.ai/ 💻 My Deep Learning Setup and Recording Setup: https://www.amazon.com/shop/aladdinpersson GitHub Repository: https://github.com/aladdinpersson/Machine-Learning-Collection ✅ One-Time Donations: Paypal: https://bit.ly/3buoRYH ▶️ You Can Connect with me on: Twitter - https://twitter.com/aladdinpersson LinkedIn - https://www.linkedin.com/in/aladdin-persson-a95384153/ Github - https://github.com/aladdinpersson
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aladdin Persson · Aladdin Persson · 45 of 60

1 computeCost.m Linear Regression Cost Function - Machine Learning
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
2 gradientDescent.m Gradient Descent Implementation -  Machine Learning
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
3 Neural Network from scratch - Part 1 (Standard Notation)
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
4 Neural Network from scratch - Part 2 (Forward Propagation)
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
5 Neural Network from scratch - Part 3 (Backward Propagation)
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
6 Neural Network from scratch - Part 4 (With Python)
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
7 sigmoid.m - Programming Assignment 2 Machine Learning
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
8 costFunction.m - Programming Assignment 2 Machine Learning
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
9 predict.m - Programming Assignment 2 Machine Learning
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
10 costFunctionReg.m - Programming Assignment 2 Machine Learning
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
11 lrCostFunction.m - Programming Assignment 3 Machine Learning
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
12 oneVsAll.m - Programming Assignment 3 Machine Learning
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
13 predictOneVsAll.m - Programming Assignment 3 Machine Learning
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
14 predict.m - Programming Assignment 3 Machine Learning
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
15 Caesar Cipher Encryption and Decryption with example
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
16 Cryptography: Caesar Cipher Python
Cryptography: Caesar Cipher Python
Aladdin Persson
17 Vigenere Cipher Explained (with Example)
Vigenere Cipher Explained (with Example)
Aladdin Persson
18 Cryptography: Vigenere Cipher Python
Cryptography: Vigenere Cipher Python
Aladdin Persson
19 Hill Cipher Explained (with Example)
Hill Cipher Explained (with Example)
Aladdin Persson
20 Cryptography: Hill Cipher Python
Cryptography: Hill Cipher Python
Aladdin Persson
21 Interval Scheduling Greedy Algorithm: Python
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
22 Weighted Interval Scheduling Algorithm Explained
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
23 Weighted Interval Scheduling Python Code
Weighted Interval Scheduling Python Code
Aladdin Persson
24 Sequence Alignment | Needleman Wunsch Algorithm
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
25 Sequence Alignment | Needleman Wunsch in Python
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
26 Codility BinaryGap Python
Codility BinaryGap Python
Aladdin Persson
27 Codility CyclicRotation Python
Codility CyclicRotation Python
Aladdin Persson
28 Derivation Linear Regression with Gradient Descent
Derivation Linear Regression with Gradient Descent
Aladdin Persson
29 Linear Regression Gradient Descent From Scratch in Python
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
30 Pytorch Neural Network example
Pytorch Neural Network example
Aladdin Persson
31 Pytorch CNN example (Convolutional Neural Network)
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
32 Pytorch LeNet implementation from scratch
Pytorch LeNet implementation from scratch
Aladdin Persson
33 Pytorch VGG implementation from scratch
Pytorch VGG implementation from scratch
Aladdin Persson
34 Pytorch GoogLeNet / InceptionNet implementation from scratch
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
35 How to save and load models in Pytorch
How to save and load models in Pytorch
Aladdin Persson
36 How to build custom Datasets for Images in Pytorch
How to build custom Datasets for Images in Pytorch
Aladdin Persson
37 Pytorch Transfer Learning and Fine Tuning Tutorial
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
38 Pytorch Data Augmentation using Torchvision
Pytorch Data Augmentation using Torchvision
Aladdin Persson
39 Pytorch Quick Tip: Weight Initialization
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
40 Pytorch Quick Tip: Using a Learning Rate Scheduler
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
41 Pytorch ResNet implementation from Scratch
Pytorch ResNet implementation from Scratch
Aladdin Persson
42 Pytorch TensorBoard Tutorial
Pytorch TensorBoard Tutorial
Aladdin Persson
43 Pytorch DCGAN Tutorial (See description for updated video)
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
44 Naive Bayes from Scratch - Machine Learning Python
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
Spam Classifier using Naive Bayes in Python
Spam Classifier using Naive Bayes in Python
Aladdin Persson
46 K-Nearest Neighbor from scratch - Machine Learning Python
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
47 Linear Regression Normal Equation Python
Linear Regression Normal Equation Python
Aladdin Persson
48 SVM from Scratch - Machine Learning Python (Support Vector Machine)
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
49 Neural Network from Scratch - Machine Learning Python
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
50 Pytorch RNN example (Recurrent Neural Network)
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
51 Pytorch Bidirectional LSTM example
Pytorch Bidirectional LSTM example
Aladdin Persson
52 Pytorch Text Generator with character level LSTM
Pytorch Text Generator with character level LSTM
Aladdin Persson
53 Logistic Regression from Scratch - Machine Learning Python
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
54 K-Means Clustering from Scratch - Machine Learning Python
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
55 Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
56 Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
57 Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
58 Paper Review: Sequence to Sequence Learning with Neural Networks
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
59 Pytorch Seq2Seq Tutorial for Machine Translation
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
60 Pytorch Seq2Seq with Attention for Machine Translation
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson

This video teaches how to build a spam classifier using Naive Bayes in Python, covering the implementation of a Naive Bayes classifier, data processing, and tokenization. The video provides a step-by-step guide on how to use tools such as pandas, numpy, and NLTK to achieve a 92% accuracy on training data. By watching this video, viewers can learn how to classify emails as spam or not spam based on feature vectors.

Key Takeaways
  1. Import necessary libraries
  2. Read email data set
  3. Build vocabulary from email data
  4. Tokenize words using NLTK
  5. Filter out non-English words
  6. Load email data into a matrix
  7. Split data into training and test sets
  8. Train Naive Bayes model on training data
  9. Make predictions on test data
  10. Classify emails as spam or not spam based on feature vectors
💡 The video demonstrates how to achieve a 92% accuracy on training data by using Naive Bayes for spam classification, highlighting the importance of proper data processing and tokenization in machine learning.

Related AI Lessons

Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →