8. Text Classification Using Convolutional Neural Networks
Key Takeaways
The video demonstrates the application of Convolutional Neural Networks (CNNs) to text classification using Keras, covering topics such as word embeddings, 1D convolutions, and max pooling. The example project utilizes the IMDB dataset and GloVe embeddings to improve text classification performance.
Full Transcript
in this video we're gonna talk about text classification with neural nets but actually not with lc/ms not today today it's going to be text classification with convolutional neural networks and I think it's super cool because you know it wasn't obvious to me at first how you would take the convolutions that we do on 2d images and apply it to text but actually you really can and you can take all the things that we've learned all the intuitions we have with processing images and use it on text so things like max pooling totally applies to text and this is really practical a lot of people do this in the real world this is something that Facebook uses to do a lot of their text classification and also in order to do it we're gonna learn about embeddings and I think if betting's are one of the most interesting coolest topics in all of natural language processing and we'll go deeper on it in a later video but the first time you see embeddings I think it's pretty cool so you know the big problem with using neural nets on texts is that it's kind of hard to get things from the from text which is arbitrary length strings into the API that I'm always talking about the specific API for any neural networks which is basically a fixed length string one of the ways to do it the kind of most common way to do it circa 10 years ago still super popular is called bag-of-words and we cover that in the first video that I did on text classification without neural nets and this is you might recall just taking each word and counting how many times it occurs in each document so you basically transform this string into a vector where the length is the number of words you have and the problem of course with this is that you completely lose the order of words so in most languages definitely in English the order of words actually matters and so dropping that order it's kind of amazing that classification can work at all and certainly it can't work as good as it could so there's another transformation that we talked a lot about in the previous video where we're generating text with LST M's and that's using the individual characters so basically one hot encoding every character in the text and that could of makes intuitive sense is like a nice transformation and we can actually Pat out the characters to make all the text a fixed length of characters but you know the problem there is that actually in in English spaces matter a lot right the concept of word is a pretty useful concept that we might want to pass into our neural network and so by just passing it in each character as a character encoding we're really making the neural network learn a lot about language so it might be too raw too extreme unless we have truly massive amounts of data so for best results we really want something in between and that's where word encodings come in here you have an example we have the sentence I love this movie and we're transforming each word systematically deterministically into a set of numbers and in this case we're transforming it into four numbers so the word eye always transforms in the same four numbers the word love always transforms into the same four numbers and we can do this with longer vectors right so in this table we have each word that we might have seen in our vocabulary and then the transform that each word gets turned into so if we don't want to calculate the embedding ourselves we can actually use some of the pre computed embedding so word Tyvek is a famous one today we're going to use the glove embedding and this is something generated by Stanford on a huge set of data and it actually has some amazing properties they're incredible properties that would be worth a whole video to explore more you can do some research on your own but basically if you take the actual embeddings whose are the actual numbers for a woman and you subtract from those numbers the set of numbers that got encoded as men and then you add in the set of numbers for King you actually get the set of numbers or a set of numbers it's very close to the numbers for Queen so this is really incredible and what it shows is that these embeddings are actually encoding some semantic information about these words and so actually using these embeddings using these numbers that are pre generated by Stanford in this case that you can download can often make your models perform even better than trying to calculate these embedding yourself this is kind of like transfer learning but for words if that makes sense so once we have these embeddings once we've transformed all of our words into these fixed length vectors and that's it has to be a fixed number of fixed length vectors so we actually have to transform we have to add padding words to make each document the same length once we've done this how do we turn it into a classifier how do we make a convolutional classifier what would that even mean so I think it's useful to go back and review what we meant by a two dimensional classifier on images right so remember that with a 2d classifier we would take an input and then we would multiply a weight by a block of values and you put the weighted sum of that block into a subsequent output image and then remove that block over by one or over by a stride and then we do that same computation with the same weights and then we'd fill in the next block over in the image and remember that we could actually have multiple outputs and what multiple outputs would mean is that we start with the same image but we use different sets of weights and so as we slide the block over we're actually in each case multiplying it by different weights and then outputting multiple images or sometimes they call it multiple channels and then you might have missed a lot of students kind of miss this how this exactly works but you can actually take in multiple inputs so if we had three input images in this case actually if we have a color image we might turn it into three channels a red Channel green channel blue channel we can do the same thing with a convolution and in this case we actually have three different blocks of weights and then we sum the result of the convolution of each block of weights on each one of the input channels and we have a single output channel so we can have multiple inputs and multiple outputs in this way and now in text we actually don't have a two-dimensional thing we have a one-dimensional thing so here you can think of that one dimension going across as the pixels of an image and you can think of the what I have is the Y dimension here is actually the different channels so instead of taking a two-dimensional block we take a one-dimensional block across the pixels so say in this case it's length 3 and we take a weighted sum of each of the pixels so in this case we would have three weights and we multiply them by one of the channels and we take that weighted sum and we fill in an output and we move that we move that block one step over or strides step over to the right and then we do the same weighted sum on the new data from our embedding and we fill in the result in the next channel or the next pixel over and we actually run that way at some across all of the channels and we take the sum and refund one value and now we could have multiple dimension output or multiple channel output and in that case we would just have different sets of weights for each of the channel that we're outputting and in this case we're actually going to learn the weights for all these different channels and what this is going to do is combine the words into smaller values it's in some sense going to give us information or it's hopefully going to learn information about pairs and triples and more of words so you might remember with images we would do this thing called a max pooling operation where we would take a block typically a two by two block and we'd find the max of the pixels in a 2x2 region well there's actually a really obvious 1d analogy to this where we look at any particular channel and we take in this case on a 2 by 2 but just the length - or it could be a different length block and we find the max or in average pooling case we find the average and with images this gave us a chance to kind of find longer-range dependencies with our convolutions and with tests it's exactly the same thing so we can actually build up the same structure that we had for classifying digits with 2d operations with 1d operations on our text so it's typical to have a convolution followed by some kind of pooling followed by a convolution followed by some kind of pooling so let's go to the code and see how this really works as usual go into the videos directory and ml class and then go into CNN - text and then open up IMDB CNN PI and let's take a look at what we have so you know the first 11 lines are just importing various libraries and then line 17 through 24 basically set some configuration parameters one configuration parameter to point out here is the vocab size so that's a thousand and that's because our embedding has to take a fixed-length set of words and figure out the what they mapped to so in this case we can only handle a thousand words so any words that are less frequent than the top thousand are actually gonna get removed from our data so line 26 loads in the IMDB data so I have an IMDB data set that's actually quite famous and these are basically movie reviews so if you run download - IMDB you'll get lots of movie reviews and the idea is to classify just from the text of the movie review if the movie was positively reviewed or negatively reviewed and neutral reviews are actually removed so all these reviews are quite clear if they're positive or negative so we're gonna magically load that data into X underscore train Y underscore train X underscore test and Y underscore test which you might remember from M NIST is basically the training input and the training target and then the validation input in the validation target so in this case the Y value is actually only 0 or 1 basically negative or positive there are no neutral reviews in this data set lens 28 through 31 basically turn this text into numbers so the first thing that happens is line 28 sets up a tokenizer and the important parameter passed into the tokenizer is the number of words that we're gonna look at so anything outside the top thousand most common words is going to get removed and that was set in config that vocab size line 29 does this actual fit on the text in this case it's X underscore train that we fit on so that looks at what actually are the most popular words and then lines 30 and 31 do the transformation from the actual strings into numbers in this case we have it transforms them into one hot encoding based on the top thousand words so the the rows here are the individual words and the columns are words in the text lines 33 and 34 take the x training X test values and they actually pad out the sequences so this actually adds essentially empty words to the text so the input to our model is all the same length so there is a maximum length that we have to give to this and that's in config dot max len in this case the longest review that we're going to consider is a thousand ordered review so we could try change that to a longer value and see if it matters but in this case all reviews are gonna get truncated to a thousand words and we have to set that to something bland thirty six sets up this model and then line 37 actually does a new kind of layer that we haven't seen before called an embedding layer and this embedding layer takes his input the vocab size because each word is going to be an input to this so we're going to take the top thousand words and find an embedding for them and then the second input is the embedding dimension so the bigger we make the embedding dimension the more numbers are transforming a word into so this is really big our model might get too complicated it might over fit or something like that if this value is too small then we might lose the information in the words and our model may under fit the data then we had to drop out layer to prevent overfitting you probably remember this from some of the image based neural networks you're building so then we add a cond one D layer this is just like the conf 2d layers they were using on digit recognition or fashion recognition or any kind of image recognition and again we have this filters parameter which is how many output channels this convolution layer has and we also have a kernel size parameter but instead of the kernel size being two numbers it's one number because it's only a one dimensional convolution padding equals valid basically means that we do no padding to our convolution so it's actually gonna shrink our output a little bit and then activation equals rally means that we run an activity we are running a rally activation function at the end of this convolution then we have a max pooling layer and then we go back to another convolutional layer and then another max pooling layer and then you might remember we do a flatten and then a dense layer and then finally one more dense layer so this is just like the digit recognition classifier that we built earlier in this series now because our model actually only outputs one number as opposed to two but we're doing a two classifier so it's positive and negative we have to use binary cross entropy to properly calculate our loss as opposed to categorical cross entropy we also use the atom optimizer as we've mostly been using throughout these classes and we also want to output the accuracy metrics so we know how well our models doing is sort of a human readable format their last line calls model that fit with our X train and this is the input the input matrix and Y train is the classes positive or negative sentiment our Bachelor is set in our configuration and our epics is set in our configuration and we also pass in our validation data let's run this model so for example here the validation los is starting to go up and the accuracy starting to go down which means that this model may be overfitting and one thing to really be aware of is that the embedding adds a lot of free parameters so if we look at the actual structure of this model there's a lot of parameters contained within the embedding itself so there's some extra things that we've made it learn so it might be interesting to try to use the embedding that we can download from Stanford's website the glove embedding that I talked about earlier and you can see an example of where to do that in IMDB - embedding dot pie so the top it tells you the first thing you need to do is actually download glove from the URL that I give you so we'll go ahead and do that so embedding is actually super similar to IMDB - CNN but there's a couple new lines I inserted here where we open up this embedding file which is actually in a super simple format and we pull out the words and we pull out the numbers that these words correspond to and then we take the embedding matrix and we look inside the words we have in our tokenizer and we actually set the values inside of our embedding layer to be exactly the embeddings that we got from glove and then in line 61 when we add our embedding layer we actually set that trainable equals false so we tell it it's weights metrics matrix by saying weights equals embedding matrix and then we set trainable equal to false which reduces the number of free parameters makes our model train faster and potentially overfit less so I really like this because we're using the fact that someone spent a lot of time training these values and they can make our model better and everyone else's model better so we can run this with Python IMDB - embedding dot pie cool so we learned about two really important things the first is we learned about how do you use word embeddings which is practical all over the place not just in this application it's super super cool and this is just one example of how to do it and the second is we learned how to take convolutions and pooling and all the things that we did on images and apply them to text in a really really practical way to get high accuracy on the IMDB sentiment data set in the next video we're going to learn how to take LST ms and apply this to the same data set
Original Description
Follow along with Lukas to learn about word embeddings, how to perform 1D convolutions and max pooling on text using Keras.
If you want to test your knowledge try to use CNNs to improve our example project at https://github.com/lukas/ml-class/tree/master/projects/8-text-classification
Github repo: https://github.com/lukas/ml-class
See all classes: https://wandb.ai/site/tutorials
Weights & Biases: https://wandb.ai/site
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Weights & Biases · Weights & Biases · 12 of 60
1
2
3
4
5
6
7
8
9
10
11
▶
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
0. What is machine learning?
Weights & Biases
1. Build Your First Machine Learning Model
Weights & Biases
Intro to ML: Course Overview
Weights & Biases
2. Multi-Layer Perceptrons
Weights & Biases
3. Convolutional Neural Networks
Weights & Biases
Weights & Biases at OpenAI
Weights & Biases
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
4. Autoencoders
Weights & Biases
5. Sentiment Analysis
Weights & Biases
6. Recurrent Neural Networks [RNNs]
Weights & Biases
7. Text Generation using LSTMs and GRUs
Weights & Biases
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
Introducing Weights & Biases
Weights & Biases
10. Seq2Seq Models
Weights & Biases
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
14. Data Augmentation | Keras
Weights & Biases
15. Batch Size and Learning Rate in CNNs
Weights & Biases
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
Testing Machine Learning Models with Eric Schles
Weights & Biases
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
Reformer by Han Lee
Weights & Biases
Sequence Models with Pujaa Rajan
Weights & Biases
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
Made with ML - Goku Mohandas
Weights & Biases
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI