Sentiment analysis and prediction in Python | Live Code-Along

DataCamp · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

This video covers sentiment analysis and prediction in Python using machine learning models, specifically focusing on text processing, feature engineering, and model evaluation. The video uses tools like Workspace, WordCloud, pandas, and scikit-learn to build a classifier for predicting sentiment from movie reviews.

Full Transcript

today we're going to be talking about sentiment analysis so the idea behind sentiment analysis is that you have some text and you want to determine whether the author is saying something positive or something negative and this is really important for product reviews since star ratings often end up with everything getting like about a four-star rating so the numbers aren't very useful and to find out what your users really think you need to read the text of the review and get some quantitative information from that so uh today's speaker to talk you through this all is justin saddlemeier so he's datacamp's workspace architect and a python expert so without further ado over to you justin thanks richie um hi everyone thanks for joining today so as richie said i'll be running through this session today um i currently work as he said as a workspace architect i also previously worked on some of davidcam's courses and projects and i have a background in psychology and marketing um in terms of the agenda for today um i'm going to start by going over and giving a quick overview of what workspace is and how you can use it then we're going to dive into some hands-on coding in workspace and we're going to perform a sentiment analysis on movie reviews and try to make predictions and then i'm going to take questions at the end for about five minutes so what is workspace well workspace is a collaborative cloud-based notebook where you can perform data analysis it's collaborative so you can share with others and work on analyses with others and it's browser-based which means that there's no installation required so all of the hassle of installing uh libraries and packages is often reduced so we aim to have the most popular libraries pre-installed so you can just import and then be ready to go um on workspace we also have a curated um list of data sets that you can use if you're not sure where to get started these are labeled both by use case and by topic so you can search for the types of data that interest you or perhaps the types of problems that you want to practice let's say you want to practice regression problems well we have tags for those so it's a great way to get started and many of these also um when you open up a workspace with one of these data sets have pre-written questions to get you going um so if you're not sure where to get started you already can get going with with some prompts that we've designed um as i mentioned it's workspace is collaborative so you can use it in much the same way as you would a google doc which means you can edit with somebody you can make comments you can publish and and choose who sees your published work you can also use that publishing feature to create a data science portfolio so when you publish your workspace or your notebook you can choose to feature it on your profile and by doing so you can turn your data camp profile into a portfolio of all of your work okay so that's the introduction to workspace now we're going to jump into the code along i believe you'll already have access to this link and if so you should be able to open up this workspace that i have open here so this is data camp workspace for those of you not familiar with with notebooks it's composed of text cells or markdown cells like we see here with the white background it also um includes code cells which you see here with the the light gray background in workspace you have the option to add a code cell which depending on whether you've chosen r python will add a python or r cell you can also add sql cells which allow you to query databases via integrations that you've set up so if you go over to the left-hand um sidebar we have this button integrations and we can see um integrations that have been set up to databases so in particular there's eight sample integrations with sample databases um that you can use to run queries today we won't be using that we'll just be reading in a flat file movie reviews.csv but that option exists so although you will be coding along with me for most of it i have written some of the code already so the first cell that we're going to run is um installing the word cloud package um that's because although we have many popular libraries pre-installed we don't have all of them in this case we're going to want a word cloud so to run a cell you can click inside of it and you can click run or you can use a hotkey i'm going to use shift return which will run it and you can see that the cell is running because it says stop execution up at the top and when it's done it should resume saying uh run all one okay so now that it's finished it is run one note is that i've included um a line here called capture in this cell this means that it's going to capture all of the output so you're not going to see all the installation information which is nice in that it can keep our notebook clean i don't recommend you always do it when installing packages because you're not going to see if there are any errors um so in this case i'm confident that this works so i'm not worried but if you do try to install custom packages you might not always want to use capture because you won't see if anything goes wrong okay so in the next cell i have some of the imports that we'll be using today so we have pandas which is many of you will probably know it's a popular data manipulation library with python we have numpy which we may use later on uh matplotlib pi plot which is used for visualization from the word cloud package that we installed we're importing the word cloud class which will allow us to visualize the word frequencies in our movie reviews next we have some imports to help us work with the text now one thing to note here is that there's a number of different libraries that you can use to work with text in python um especially if you want to do more advanced things you'll want to look into libraries such as nltk and spacey we have data camp courses on both um in this case we're going to keep it a bit simple and use um a class tf idf vectorizer which i'll go into later to help us um process the text in a in a simple way that will still get us reasonable results but if you are interested in in more advanced topics in natural language processing and more advanced sentiment analysis you will probably want to leverage the other libraries that i mentioned uh the other imports we have are just um to help us fit a model to our data so we're gonna train test and uh split our data into train test um subsets we're gonna use a random force classifier it's not necessarily the best model to use it's just one that i found works that's not as important as as the previous steps and then also we're going to import some metrics to help us interpret our results so the next thing we're going to do is we're going to import our data within the cell so i'm going to use the pd pandas read csv function and we're going to read in movie reviews.csv now you can look at what files are available in your directory by clicking on the browse and upload files button on the left hand side so if you click on this we can see the contents of our directory and we have our file movie reviews and we have the notebook which is actually the workspace that we're working in now with both you have the option to rename download them you can also copy the path to the clipboard and you can delete them which i uh don't advise in this case um but yeah that's how we're going to read in our our movie reviews data we're going to assign it to a data frame df and then to view it we're going to use sample now one thing to note is that in workspace you can actually just call the data frame that you want and it'll automatically print a pretty interactive table that's paginated so you it'll show the 10 first 10 results and then you can click over so those of you who are used to working with pandas you don't necessarily need to use head you can just call the data frame the reason i'm not going to do it in this case is simply because you'll see that the text reviews are quite long and it'll clutter our notebook quite a bit so i'm actually going to use sample just to randomly sample three entries from our data frame and so we can inspect it so if i run this we should see our data frame so here are three randomly sampled rows um you'll see that it's a relatively simple data set so it has the movie reviews in a text column here or labeled text and then we have the label and zero corresponds to a negative review and one corresponds to a positive review so by chance we've just selected three negative reviews but these should be a combinations of zeros and ones and we can confirm that in a later step the next step we may want to do and what i like to do is also take a look at the data types and non-null values so if you call df.info so we use the info method of the data frame and run this again either clicking run or using in my case shift return we can print out the information about the data frame right away we can see that there's 4 000 rows and there's 4 000 non-null values in each column so we already know that there's no missing values that we need to deal with which is nice to know um otherwise we'd have to use some additional methods to handle that data either dropping it or performing imputation or or some other method um so that's good to know we can also look at the data types so object implies that it's a mixed data type which usually means that it contains string values which is exactly what we'd expect for our text column because it contains text or written information and then our label is an integer so it's um hopefully just zeros and ones but we can confirm that so now that we know that the data types are correct and there's no missing values we can use um uh the value counts method so first we're going to want to reference our series which is our label and then we're going to use the value counts method and i'm going to use set normalize to true and this will return the proportions rather than the absolute numbers and if we run this we can see that basically fifty percent are zero and fifty percent are one which is nice this means that the positive and negative views are balanced if they weren't we might have to consider other methods to handle that because if you have a severe class imbalance when performing a classification problem it can often bias models to favor the majority class so let's say we had a data set where there were only a small sprinkling of positive reviews well a model might be inclined or biased to predict everything is negative and still maintain a high level of accuracy so we don't want that and in cases like that you'll need to perform something like resampling your data so you can down sample to reduce the majority you can up sample you can also perform advanced techniques such as synthetic resampling but in our case it's balanced we don't need to worry about that we have basically the same number of positive negative reviews which means we can continue going um but before we continue on to preparing our data for a machine learning model let's take another um look at our data um using the word cloud that we imported earlier so to do that we're first going to need to basically join all of the reviews into one gigantic string that contains all of the words across all of the reviews and to do that we can use the join method of a string so i'm going to write sort of an empty space enclosed by double quotation marks and then we're going to use dot join and then in brackets we're going to pass in our text information which is the text column of our data frame i'm then going to initialize a word cloud object and that will be done using the word cloud class that we have here and there's a few things we will want to change by default one is that by default uh the background color i think on a word cloud is usually black um which i think is is um not as nice to look at so we'll change that to white we're also gonna set the stop words to the english stop words that we imported earlier now stop words are just common words um that aren't necessarily informative um when performing a natural language processing or in particular sentiment analysis so words like the and or or those aren't those are going to be incredibly common but if we're looking at frequencies of words we don't really care that the is used x number of times in a document so we want to strip those away from our word cloud but also from our um analysis later now one note is that um word cloud has a default set of stop words so if you don't specify this uh argument you will still get the stop board stripped away i'm going to use the ones that we imported earlier with the scikit-learn imports just for consistency because that's what we'll be using later but you'll get similar results if you don't specify this argument i'm going to set a width of 800 and a height of 400. um and then with our word cloud initiated we just have to generate um use the generate method so dot generate from our reviews which is the string we created above and before showing there's a few things we'll need to customize to to ensure that it shows nicely one is to set the figure size larger and this is one way to do it so we don't write figured out fig size um and we set this maybe to 12 and eight um we can call um then we're going to call our word cloud and you can use uh an arc parameter called interpolation and we can set that to bilinear and that's just going to ensure that the word cloud shows a little bit more smoothly it's not that important but it helps for visualization we're also going to call plt.axis and we're going to set that to off otherwise you'll get the ugly axes x and y axes that you'd see on a normal matplotlib plot and then we call show so running this provided i haven't made any typos we should see our word cloud now i think the if i run it again the figure size should be better the second time there we go i don't know why it didn't kick in the first time um so here we have a word cloud of the frequencies of the most popular words in all of across all of the reviews so words that are featured more frequently are larger words that are frequent uh mentioned less frequently are smaller um right away you'll see the types of words you'd expect movie film character what else am i saying there director comedy so there it's very clearly a a data set about movies so we have loaded in the correct data which is good to know another thing to notice though is a lot of these um words might not necessarily be indicative of sentiment uh movie and film aren't that descriptive time isn't really that descriptive character is also not incredibly descriptive um or at least intuitively i don't expect you'd be able to infer whether someone was talking about a positive or a negative review by those words so we're going to need to keep that in mind when processing the text but otherwise it looks okay it looks like there's not really any stop words in there so that worked correctly so we're ready to go on and start processing our text to fit a model to it so um to process the the text there's a number of different ways that you can transform text data into a numeric format for um to be interpreted by a machine learning model um and there's lots of different ways that you can also um break down text um and create uh uh features i'm not gonna go into all of them today but if you're interested there's topic uh there's techniques such as stemming and lemmetization which allow you to break down words into their root words which allows you to really clean up text today we're going to use term frequency inverse document frequency and it's otherwise known as tf idf and it's essentially a way to calculate the importance of words in a collection of different sets of text or documents which in our case will be reviews so what it essentially does is it will apply a higher score to words that appear frequently within one document or one review um that don't appear across other uh reviews or other documents so it penalizes ones that are incredibly common across all of them but it'll assign higher scores to ones that are unique to them so hopefully that will help us uh sort of hone in on important words that are loaded with sentiments and might be indicative of a positive or negative review so tf idf has basically the advantage of um handling uh stop words which we're also going to already uh deal with but it uh applies a larger penalty to them but it'll also penalize the types of words like movie or film which we'd expect to be high uh used quite frequently across all of our documents so what we're essentially going to do is we're going to tokenize our data so we're going to break it into the smaller parts in this case we're going to tokenize it into words and then we're going to vectorize it or basically turn it into a numeric representation and i'll show you we'll get a look at this after we complete this code cell the first thing we're going to do is specify a pattern to break apart our text and we're going to use a regular expression for those of you not familiar with regular expressions they're basically the ability they're basically patterns that you can use to work with text it can allow you to filter text extract important parts of text search text a nice use case might be if you have a an enormous document and you just want to extract all of the phone numbers from it well using regular expressions you can define a pattern that would be perhaps maybe three digits followed by a dash followed by four digits followed by dash followed by four digits and then you could filter that document and extract every bit of text that meets that criteria that you specified in your regular expression they're a very powerful tool for working with text and any kind of natural language processing sentiment analysis anything like that you will probably want to learn regular expressions we have a course dedicated to them on data camp um but i think they appear in a large number of the courses that we have on the platform we're going to specify pretty simple pattern today um just to extract uh things that meet the criteria of what a word would be it's not going to be absolutely perfect in the interest of simplicity but it will work pretty well in breaking apart the text so i'm first adding an r to tell python this is a raw string um in double quotation marks um i'm going to add the square brackets and this means to find any characters that meet this that are within the square brackets so we're going to do a dash lowercase z or z and then uppercase a dash z so this says any characters that are lowercase a to z or any characters that are uppercase a to z and we're going to add a plus sign to our pattern and basically what this will say is find any groups of characters that are alphabetic and any number of them one or more and break apart our text by that pattern just extract bits of text that meet that pattern the next thing we'll do is initialize the tf idf vectorizer that we imported earlier um and this will help process our text we're going to specify our token pattern which is the pattern we just uh created now we're going to specify the stop words which again will be the english stop words we specified earlier um we're going to specify an n-gram range um what n-grams are is um the number of um token unique tokens or combinations of tokens so a unigram would just be one token so basically an individual word in our case a bigram would be a combination of two tokens or in this case two words and what this vectorizer will do is it will also give them the opportunity also return uh unique combinations of two words if we set this to one and two so it's going to give the option of bi our unigrams and bi-grams i think with the max features we're setting we're not going to have any bigrams in our features but i'm just going to allow for the possibility of it and then uh finally we're going to set the max features to 500. um you can set it to more but i think it might take a bit longer to run so it's probably safer just to set it to 500 for now with our vectorizer initialized we're then going to fit it to our text data so our data frame and then with bracket notation text then we're going to create a matrix from the vectorizer so we're going to call this tokenized features and after having fit the vectorizer to the text we're then going to transform the text so we can use the vect.transform method and then pass in again our text and the final thing we want to do is then create a data frame from it so we can inspect it but also pass it to a machine learning model so the data is the tokenized features that we created and we're going to use the to array method to sign into array the columns we're going to [Music] use the get feature names out method okay so what we're doing here is we're creating a data frame the data is going to be the array um that we created when we transformed our text the columns are going to be the feature names that were stored in the vectorizer and if we print this out we should see our vectorized data so here what we have is you'll see that we've now transformed each review into having 500 features each with a separate word and should the review have that word there's a tf idf score assigned to each one so now we've basic we've transformed our text data into a numeric format that we can now put into a machine learning model um before we pop something into a machine learning model i thought it would be interesting to add a few more features so we now have vectorized all of our our data but we can also extract some information from the actual reviews themselves so not just the words that are in them but something like the length so if you see not all reviews are the same thing they're going to have different people write differently so some people might use longer words than others so i thought it'd be interesting to actually use some information about the length of the review and the words that are used in it to add some additional features so um i was inspired by this really cool article where the author created a bunch of different features even more complicated and more advanced than what i'm going to do here um but because they even went as far as calculating sentence length we're going to keep ours a bit simple we're going to create a feature called character count so basically how long the review is and we're going to do that by taking our text column using the string accessor so dot str and then count and then again we're going to specify a regular expression so we pass in r to tell python treat is a regular um as a raw string so it doesn't interfere with the other ways that python works i'm going to use a backslash and an s this is a special regular expression um character that says anything that's not white space so anything that's not a space or a new line character so basically this is going to find all of the characters in a review that don't include spaces and we're going to count all of them and store that as a new column called character count we're gonna create another one called word count and we're gonna um process the text in the the same way um except this time we're just going to pass in the word pattern that we used earlier we might as well just reuse it and and count the number of words uh by using the pattern we defined to extract words for our vectorizer and the final feature we'll create is average word length and that will just be the character count divided by the word count so if we preview our data frame now i'll use sample again we'll see now that we've added three more features how many characters there are in the view the words and the average word length because perhaps maybe when someone's disappointed with the film they're less inclined to use long extravagant words or maybe maybe happy people have less to say so they actually have a shorter review i'm not sure but it doesn't hurt to include in our model okay um before i keep going um i think i'll quickly uh take a few questions if there are any so richie if you if you have any questions i can um sure yeah we do have a few questions from the audience so the first one is this technical question there are a few people struggling uh with errors about the word cloud package not being installed so could you just go back and review uh what the installation procedure was for getting uh you should be able to just run this cell and it should work um if i take away the the capture and run it um it probably says it's already installed but if i take away capture you can at least see the the log and it should have installed so um after that having run it you should then be able to then import it using this cell which was already pre-written okay so uh it's just a case of making sure that you run the first cell that includes the install command and for anyone who is now a bit behind because they couldn't get that working um there is a link to the solution uh notebook in the chat uh it might be worth you opening up the solution notebook and you can just um at least uh maybe copy and paste the first few cells so you can catch up to where everyone is now and then you can join in for the second half okay so next question is from uh lalit saying um where did where does this label data come from um how would you go about getting those ones and zeros for the label column yeah that's a good question i believe it's probably been manually labeled i got this off of kaggle it was commercially available um i believe somebody has probably gone through and manually labeled it i haven't inspected each one to know whether um that's exactly what's been done but i assume it's somebody who's read it or maybe um they inferred it from the numeric rating i'm not sure exactly how it was labeled um yeah i'm afraid i can't answer that having recently done one of my own um classification projects i had to go in and manually label a lot of data so it's certainly a possibility but with four thousand rows i'm not sure if someone went through and did that many okay um one thing i'll say is you'll be able to find lots of projects on kaggle where there will be already pre-labeled data so if you want to practice on different data sets um you will often find it in this structure where something's been labeled and hopefully there will be instructions on how they were how it was labeled you'll be able to get that information um absolutely it sounds like some pretty grueling labor that going through reading all these reviews and uh deciding whether it is positive or negative i'm glad that wasn't me having to do that um all right so next question comes from uh jonathan saying um if you have imbalanced text data so uh lots of one kind of class and not very much of another kind of class uh what resampling techniques do you recommend yeah so you wouldn't i think unless there's advanced techniques i know about synthetic ones probably aren't that possible um i know some of the and that they're they have their own problems if you have a lot of data you can do down sampling so you can basically drop randomly drop um your majority class to try and get the balance in check a little bit more you can also just randomly resample your uh minority class that of course introduces problems and that you're going to be reintroducing duplicate data for the more uh like sort of advanced techniques um i i guess if you after you vectorize it you may also be able to perform something like smote which is uh synthetic resampling i think you can do that after you've vectorized the data so that might also be an option is after vectorizing it you can then use um sort of a synthetic technique which uses neighbors to create synthetic examples of reviews um so i think it would be using the um like in our case the um the the data here the vectorized data here and could create synthetic examples so i think there are options even with text not just sort of standard numeric data and just i know you recently created a workspace template on um resampling data sets so i see um ade has posted that link in in the chat so if anyone is interested in resampling uh is that uh second to last link so it's called template python b sampler data set so that's um that's one to follow up on if you're interested in that sort of thing yeah so that goes over a few of the different techniques that i just described absolutely and jonathan she had a second part of that is like saying um how do you classify uh whether a social media post um is like uh reddit or twitter is related to um a particular topic so the example he gave was like is it related to mental health or not i think topics go more generally though okay i think that's topic identification uh i think that's a maybe a related topic it's one i'm not as familiar with um i think we actually do cover it in one of our courses so i'd encourage you to follow up you can correct me if i'm wrong richie but i think that's covered to some depth in introduction to natural language processing um yes certainly um we do have courses on topic modeling or topic identification uh in python and there again there is um there's a workspace template as well uh so adi is also post i see a link to that in the chat so that's that's the final link in the chat so it says topic identification with uh tf idf all right um and we've got one final question um before we move on again so uh lalit again so asks uh what what exactly uh do you mean by stop words ah yes sorry um i mentioned it earlier but i'll i should have uh elaborated so stop words are just words that are incredibly common that you would expect to find in basically any piece of text but it's not going to be that informative so if we go through maybe one of the initial ones a um this that um there be those kinds of words are incredibly common but they will be common across positive and negative reviews most often i i would expect so we often when performing a text analysis and processing we try to strip those out of there so that we're not going to um basically load unnecessary features into our model um you'll also notice we did it with the the word cloud again we're not seeing words that actually probably are more frequent in our text in our in our reviews like the like the and b and there just because it's not that interesting to our to our analysis we want to know what are the the words what are the most frequent words that aren't just everyday speech and sure enough when we strip those out we do see sort of um evidence that this is a movie review data set but if we didn't movie would probably be far less frequent and would be featured much more uh would be small so that words out yeah could we just print out the english stop words variable in the notebook somewhere so we can see the words so if you look they're very common words that was a good idea appreciate that they've saved me fertilizing it um but if you see they're words that probably won't be very indicative of sentiment and also are just going to be incredibly common well i don't know how off uh how common hereafter is but most of them are common words and they're not going to be very helpful for our sentiment analysis all right super that's the last of the questions so i think we're ready to move on to the modelling section of this okay so the next thing we're going to do is fit a model to our data and evaluate its performance so what we're going to do is we're going to create our features and call it x we're going to use uh pandas concat function we're going to pass in a list and we want to add the vectorized features that we created earlier and we also want to add in the new features we created so i'm going to take my data frame and i'm going to use the lock method so df.lock with square brackets i'm going to pass in a colon followed by a comma and this says grab all of the rows and then i'm going to say from character count onwards so if i um write the the name of the column that i want to select character count and another colon it basically says select all of the columns from character count onwards so all rows of this and these three columns and we're going to concatenate or combine them along the uh x or the uh first stack or second axis python zero index along the columns we're basically combining them so that um we have our three new features and all of the uh vectorized features that we created our target variable is going to be a lowercase y and it's simply our label column uh then we're going to split apart our data so we're going to use training test split so x tests or y or x train x test y train y test i'm going to use train test trust test split and we are going to pass in x y test size will keep um 25 of our data for uh evaluation or testing we'll set a random state of um 42 just so if we rerun this we get the same split we're going to create a random forest classifier which is just an ensemble method which uses a number of decision trees to predict uh or to classify samples um oops we're gonna again set the random state to 42 we're going to fit it to our training data so using the dot fit method x train y train we're going to create our predictions uh using the variable y pred and we're going to use the predict method with our um testing features that we saved and then we're going to print the classification report so basically the metrics of our predictions uh you'll need to uh enclose the classification report in a print statement or otherwise it won't render correctly now one thing to note is this is i'm using a very simple simple modeling process i'm not going to be using up trying out and comparing different models i'm not going to be doing any hyper parameter tuning that's beyond the scope of this um this code along what we're really just just doing is we're just fitting sort of a simple model we're not playing with it too much and we're just going to see how it performs uh so if i run this right away we see that we have an accuracy of almost 80 percent so 80 of the time we're correctly predicting um uh the the sentiment of the review which isn't bad i mean we we used sort of an out of the box uh vectorizer from scikit-learn we didn't do incredibly advanced um text processing we added a few simple extra features which we're going to see how much how important they were um but already we're 80 of the time we're getting it correct if we look at the precision of recall scores we can also see that it's not like um we're having any issues with um predicting one class or another so for instance the recall of positive reviews or one is 81 which means of all the positive reviews out there we're getting 81 of them so it seems to be performing pretty well we can also validate this with um confusion matrix we're not validated but just check it out with a confusion matrix so if we use confusion matrix display and then um [Music] yeah display uh and then use the from estimator method we can pass in our classifier which is the random forest we initialized earlier we can pass in x test data the y test test data if i spell it correctly and set normalize to all what this will do is it will give us the proportions rather than just the raw counts of false positives and false negatives and true positives and true negatives we'll give this a title confusion matrix and we'll show the plot oh i had a typo okay autocomplete is uh okay so if i run this we get our confusion matrix so for those of you not familiar with it what a confusion matrix has is our predicted labels along the x-axis and our true labels along the y-axis so what we can see is that of all prediction 40 predictions were um correctly predicting negative reviews as negative 39 of our predictions were correctly uh categorizing positive reviews as positive so one and one um we can also see the number of um incorrect or the percentage of incorrect predictions we made so 12 of our predictions incorrectly predicted a negative review as positive so a false positive um about nine percent of our predictions incorrectly predicted a positive review as negative but as i said before overall a pretty decent accuracy um and and not bad considering that in maybe what 30 minutes we we loaded in the data inspected it processed it factorized it and threw it into a machine learning model one final thing we may also want to do is visualize the feature importances so with a random force we can also print out how much each feature contributed to the classification so we can see which features were essentially most important in predicting whether our view was positive or negative so i'm going to create a data frame just to help visualize it and in the data frame i'm going to pass in a dictionary the first um the first column we want is the feature so we'll just use x dot columns so that is basically getting the names of the features in our future variable x and then the other column we're going to want is the importance and i'm going to use uh rf feature importances so this is extracting the feature importances from our trained model be sure to follow that up with an underscore and that's how you access them and then um so that's going to initialize our data frame but we also want to sort the values so if we sort them we're going to want to sort them by the importance so that'll be the numerical score reflecting the future importances and we're going to want to uh write it in uh descending order we want the highest values first so i'll set ascending to false if i run this we get our feature importances so you can see the top one the top ones are basically exactly what you'd expect to be indicative of a bad review i mean of course there's going to be instances where people say it wasn't bad or it wasn't the worst but overall when people say bad worst great and awful they're generally going to be um it's generally going to reflect what you'd expect it was a great movie it was an awful movie et cetera et cetera you'll see there's a t in there that simply i think i believe a relic of our um the way we extracted tokens so we didn't allow for apostrophes we used a pretty simple regular expression token but if you think of a word like don't or can't or won't um it's going to split that apart by the apostrophe because we're not going to be we don't allow for those in our pattern so that's why t is going to be a token that we've extracted it may actually be quite common in perhaps negative reviews that's made whites up theirs i won't see this again or something like that comes up um but overall besides that one uh the rest i think are tokens that you would expect to see um [Music] in positive and negative reviews so sort of confirmation that our model is using the features we'd expect um if i scroll through i don't see the features we created so they may actually not have been that useful in the predictions so in this case um maybe it wasn't uh they weren't necessary but it didn't hurt to add them into it okay so um that's that's basically it we we went through we loaded in some review data uh we inspected it we visualized it a little bit with a with a word cloud to get an idea of what the contents of the reviews were like then we tokenize and vectorize the data to turn it into a numeric format so that we could um into it so that a machine learning model could interpret the data we added a few more features just to see if perhaps they might improve the accuracy and then we fit a model and evaluated its performance and i mean there's probably room for improvement and there's lots of ways that you could look at improving this the method of tokenization could probably be improved you can try out different models you can try some hyper parameter tuning which is basically uh adjusting some of the default settings of the model to see if you can get a better uh better performance as i said before you can also perform more advanced text processing techniques so like i said there's ways to handle uh words like don't or can't or won't you can reduce them down to their root words um there's lots of different ways that we can probably squeeze more out of this model um but i think 80 percent for for a first quick run through isn't too bad um okay so i will um jump back in uh to the end as richie said um there is a solution workspace which you can see afterwards um and i think that'll be shared in the chat but for now i think i'm ready to uh take questions uh richie if there are any all right super thank you very much justin uh yes there are some more questions so uh one from medi asking can we apply the same process to multi-class classification so where we've got more than just zeros and ones if there are three or more different categories yes you can i i think um going over it i think would be a little bit beyond the the scope of this today but yes certainly you can do multi-class uh prediction yeah so it's gonna depend on the the kind of model using but for random forest absolutely it's gonna work isn't it all right you

Original Description

In this live training, you will build a machine learning model to predict the sentiment of a review using the contents of the review. We will walk through all steps of the machine learning process, from importing the text data, tokenizing and vectorizing the text samples up to training a classifier and evaluating its performance. To code along with the video in a pre-prepared workspace, go to bit.ly/sentiment_analysis_webinar Chapter Timestamps - 00:00 - Intro 03:27- Load libraries and data 09:42 - Inspect and explore data 17:11 - Pre-processing the review text 36:15 - Fit and evaluate a model Interesting Reads: Should BI Analysts Learn to Code? | DataCamp https://www.datacamp.com/blog/should-bi-analysts-learn-to-code Python 2 vs 3: Everything You Need to Know | DataCamp https://www.datacamp.com/blog/python-2-vs-3-everything-you-need-to-know The Best SQL Jobs in 2022: Unlock new career paths with SQL | DataCamp https://www.datacamp.com/blog/the-best-sql-jobs-in-2022-unlock-new-career-paths-with-sql
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →
1 SQL Server Tutorial: Date manipulation
SQL Server Tutorial: Date manipulation
DataCamp
2 R Tutorial: Intermediate Interactive Data Visualization with plotly in R
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
3 R Tutorial: Adding aesthetics to represent a variable
R Tutorial: Adding aesthetics to represent a variable
DataCamp
4 R Tutorial: Moving Beyond Simple Interactivity
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
5 Python Tutorial: Why use ML for marketing? Strategies and use cases
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
6 Python Tutorial: Preparation for modeling
Python Tutorial: Preparation for modeling
DataCamp
7 Python Tutorial: Machine Learning modeling steps
Python Tutorial: Machine Learning modeling steps
DataCamp
8 R Tutorial: The prior model
R Tutorial: The prior model
DataCamp
9 R Tutorial: Data & the likelihood
R Tutorial: Data & the likelihood
DataCamp
10 R Tutorial: The posterior model
R Tutorial: The posterior model
DataCamp
11 R Tutorial: An Introduction to plotly
R Tutorial: An Introduction to plotly
DataCamp
12 R Tutorial: Plotting a single variable
R Tutorial: Plotting a single variable
DataCamp
13 R Tutorial: Bivariate graphics
R Tutorial: Bivariate graphics
DataCamp
14 Python Tutorial: Customer Segmentation in Python
Python Tutorial: Customer Segmentation in Python
DataCamp
15 Python Tutorial: Time cohorts
Python Tutorial: Time cohorts
DataCamp
16 Python Tutorial: Calculate cohort metrics
Python Tutorial: Calculate cohort metrics
DataCamp
17 Python Tutorial: Cohort analysis visualization
Python Tutorial: Cohort analysis visualization
DataCamp
18 R Tutorial: Building Dashboards with flexdashboard
R Tutorial: Building Dashboards with flexdashboard
DataCamp
19 R Tutorial: Anatomy of a flexdashboard
R Tutorial: Anatomy of a flexdashboard
DataCamp
20 R Tutorial: Layout basics
R Tutorial: Layout basics
DataCamp
21 R Tutorial: Advanced layouts
R Tutorial: Advanced layouts
DataCamp
22 Python Tutorial: Time Series Analysis in Python
Python Tutorial: Time Series Analysis in Python
DataCamp
23 Python Tutorial: Correlation of Two Time Series
Python Tutorial: Correlation of Two Time Series
DataCamp
24 Python Tutorial: Simple Linear Regressions
Python Tutorial: Simple Linear Regressions
DataCamp
25 Python Tutorial: Autocorrelation
Python Tutorial: Autocorrelation
DataCamp
26 R Tutorial: The gapminder dataset
R Tutorial: The gapminder dataset
DataCamp
27 R Tutorial: The filter verb
R Tutorial: The filter verb
DataCamp
28 R Tutorial: The arrange verb
R Tutorial: The arrange verb
DataCamp
29 R Tutorial: The mutate verb
R Tutorial: The mutate verb
DataCamp
30 R Tutorial: What is cluster analysis?
R Tutorial: What is cluster analysis?
DataCamp
31 R Tutorial: Distance between two observations
R Tutorial: Distance between two observations
DataCamp
32 R Tutorial: The importance of scale
R Tutorial: The importance of scale
DataCamp
33 R Tutorial: Measuring distance for categorical data
R Tutorial: Measuring distance for categorical data
DataCamp
34 Python Tutorial: Plotting multiple graphs
Python Tutorial: Plotting multiple graphs
DataCamp
35 Python Tutorial: Customizing axes
Python Tutorial: Customizing axes
DataCamp
36 Python Tutorial: Legends, annotations, & styles
Python Tutorial: Legends, annotations, & styles
DataCamp
37 Python Tutorial: Introduction to iterators
Python Tutorial: Introduction to iterators
DataCamp
38 Python Tutorial: Playing with iterators
Python Tutorial: Playing with iterators
DataCamp
39 Python Tutorial: Using iterators to load large files into memory
Python Tutorial: Using iterators to load large files into memory
DataCamp
40 SQL Tutorial: Introduction to Relational Databases in SQL
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
41 SQL Tutorial: Tables: At the core of every database
SQL Tutorial: Tables: At the core of every database
DataCamp
42 SQL Tutorial: Update your database as the structure changes
SQL Tutorial: Update your database as the structure changes
DataCamp
43 Python Tutorial: Classification-Tree Learning
Python Tutorial: Classification-Tree Learning
DataCamp
44 Python Tutorial: Decision-Tree for Classification
Python Tutorial: Decision-Tree for Classification
DataCamp
45 Python Tutorial: Decision-Tree for Regression
Python Tutorial: Decision-Tree for Regression
DataCamp
46 Python Tutorial: Census Subject Tables
Python Tutorial: Census Subject Tables
DataCamp
47 Python Tutorial: Census Geography
Python Tutorial: Census Geography
DataCamp
48 Python Tutorial: Using the Census API
Python Tutorial: Using the Census API
DataCamp
49 R Tutorial: A/B Testing in R
R Tutorial: A/B Testing in R
DataCamp
50 R Tutorial: Baseline Conversion Rates
R Tutorial: Baseline Conversion Rates
DataCamp
51 R Tutorial: Designing an Experiment - Power Analysis
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
52 R Tutorial: Introduction to qualitative data
R Tutorial: Introduction to qualitative data
DataCamp
53 R Tutorial: Understanding your qualitative variables
R Tutorial: Understanding your qualitative variables
DataCamp
54 R Tutorial: Making Better Plots
R Tutorial: Making Better Plots
DataCamp
55 SQL Tutorial: OLTP and OLAP
SQL Tutorial: OLTP and OLAP
DataCamp
56 SQL Tutorial: Storing data
SQL Tutorial: Storing data
DataCamp
57 SQL Tutorial: Database design
SQL Tutorial: Database design
DataCamp
58 Python Tutorial: Introduction to spaCy
Python Tutorial: Introduction to spaCy
DataCamp
59 Python Tutorial: Statistical Models
Python Tutorial: Statistical Models
DataCamp
60 Python Tutorial: Rule-based Matching
Python Tutorial: Rule-based Matching
DataCamp

This video teaches how to build a machine learning model for sentiment analysis using Python and various tools. The model is trained on movie reviews data and achieves an accuracy of 80%. The video covers text processing, feature engineering, and model evaluation, providing a comprehensive introduction to sentiment analysis.

Key Takeaways
  1. Load and inspect the movie reviews data
  2. Tokenize and vectorize the text data
  3. Create features and split data into training and testing sets
  4. Train a random forest classifier and evaluate its performance
  5. Use feature importances to visualize feature contributions to classification
💡 The choice of tokenization method and hyperparameter tuning can significantly impact the performance of a sentiment analysis model.

Related AI Lessons

10 ChatGPT Prompts for Job Seekers: Resumes, Interviews & Career Growth
Learn how to leverage ChatGPT for job searching, resume building, and career growth with 10 actionable prompts
Medium · ChatGPT
Lost in Transcription: The Week the Machine Started Lying
Learn how Whisper AI transcription can be flawed and understand the importance of validation in AI-generated text
Medium · AI
How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI

Chapters (4)

Intro
9:42 Inspect and explore data
17:11 Pre-processing the review text
36:15 Fit and evaluate a model
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →