Exploring Modern Sentiment Analysis Approaches in Python | Real Python Podcast #232

Real Python · Beginner ·🧠 Large Language Models ·1y ago

Key Takeaways

The video discusses modern sentiment analysis approaches in Python, covering traditional lexicon-based and machine learning approaches, as well as using specific types of LLMs for the task. It also explores various tools and Python packages for sentiment analysis, including VADER, TextBlob, and NLTK.

Full Transcript

welcome to the real python podcast this is episode 232 what are current approaches for analyzing the emotions within a piece of text what tools in Python packages should you use for sentiment analysis this week on the show Jody bertrell developer advocate for data science at jetbrains returns to discuss sentiment analysis in Python jod has a PHD in Clinical Psychology we discuss how her interests in studying emotions has continued across her career Jody covers three ways to approach sentiment analysis we start by discussing traditional lexicon-based and machine learning approaches then we dive into how specific types of llms can be used for the task we also share multiple resources so you can continue to explore sentiment analysis yourself this episode is sponsored by Sentry at Sentry we don't just build tools we use them to debug our trickiest slowdowns hear the whole story later in the episode and find others like it at blog. sentry.io all right let's get started [Music] the real python podcast is a weekly conversation about using python in the real world my name is Christopher Bailey your host each week we feature interviews with experts in the community and discussions about the topics articles and courses found at real python. after the podcast join us and learn real world python skills with the community of experts at real python. comom hey Jody welcome back hi it's been H it's been a while it's been almost a year actually yeah I know I've missed you a same same so I'm excited to have you back on and we're kind of diving back into an area with the natural language processing you were on episode 119 the title of it was natural language processing and how ml models understand text today we're g to take that kind of a little bit further and so if there's some deeper research you need to know about the idea of you know how does NLP work you can kind of go there it's been a busy year yes you've been at a lot of conferences I don't if you want to mention any conference talks or things that you've done recently or we can always list them at the end and share some but any shoutouts yeah so um again this year I've been mostly focusing on llms I decided to kind of go a bit more into maybe like specific pockets of of things that interest me so um kind of shout out my Pon us talk this year was on hallucinations so if you want to know more about those that recording's already up I keynoted at Pon Italia my first keynote I was so stoked to get asked awesome and I decided to go back to my psychology background which I believe we'll be talking about a bit today and I was talking about things like our llm sentient are they intelligent so just debunking a few of those myths yeah and then recently I did another keynote which is unfortunately not out yet but that one is about how we actually measure llm performance and investigating how fragile that is so yeah you can see I've sort of gone back to my academic rotes a bit and kind of going away from you know more of the applications and and talking more about the more kind of academic side of things so it's been it's been quite fun hard talks to write but yeah a lot of interesting stuff did you celebrate your keynote in Italy with a nice pasta dinner there I did I did and I will also say I I celebrated The Kino the second one in Porto with maybe a few too many Porto and tonics and a karaoke i s i sang Britney Spears oh wow what's your go-to song tox toxic oh yes there you go I love that song that's great yeah yeah all right well cool so uh we wanted to dig into a topic that is still really useful in the ml world of sentiment analysis and how we can kind of use that to derive information from written text and and and so forth and I guess with lots of the translation tools and stuff like that we can go beyond written text and use it in other ways so yeah I guess we could start maybe with a little bit of your background you said you wanted to talk about well you have a PhD so so yeah it it's kind of funny so sentiment analysis was let's say my Gateway into NLP because okay so quick background on what I studied so I studied psychology as we've already mentioned that's what I did my PhD in and I was always super fascinated by kind of personality psychology Clinical Psychology and emotions relationships so I actually did my PhD in hurt feelings that's great I don't if you know the song Oh yes yeah no that was the theme song for my oh you go okay of the Concords yeah fly to the Concords yeah got hurt feelings I actually did I did a radio interview during my PhD and they they played that as part of the inter oh nice there we go but yeah also did like another project on romantic jealousy so I like really still like the emotion stuff and when I kind of discovered that there were ways that you could automatically detect emotions with you know programming I just thought that was super cool so yeah yeah yeah I started off with sort of simpler stuff and we'll talk about those sort of approaches but as I've gone through learning more about llms obviously they have applications for this as well I promise this will not be an AI episode okay there will be a lot of other stuff as well but obviously stuff you can play with as we go as yeah the llm stuff is is pretty accessible as well if you want to play with this um you can get models out of the box that just work so nice yeah okay you wrote this thing about sort of measuring ekman's basic emotion types and um I had to look it up sorry um that's okay I love that they use the uh gosh what's the Pixar movie oh um inside out inside out yeah they have a lot of let's give you an animated way of explaining this which is kind of funny but this idea of not only what these sort of I guess core emotions are but one of the things I thought was interesting is well first off like you mentioned it a little bit what led you to studying emotions like yeah I I think it's because the thing that always fascinated me the most in Psychology is like how people relate to each other in really close interpersonal settings so romantic relationships is obviously kind of the core one sure but any sort of relationship where you have what's called an attachment Bond so it's basically like you have this sort of sense of being able to rely on that other person as a sense of your kind of psychological stability and I was always kind of fascinated like what happens when things go wrong with those relationships and you have emotional fallouts so you have jealousy and you have hurt right but I also just really like when I was a clinical psychologist there's this kind of like very complex relationships between your body and your emotions and your thoughts and there's different kind of schools of thought about how they relate to each other yeah but say in particular we were actually talking about mindfulness before um we started recording yeah we were part of kind of the idea of mindfulness is being able to sort of recognize the physiological changes that come with an emotion and emotions do have like kind of a signature on your body yeah yeah and recognize that before you react to it and then you know have the thought Cascade and the behavior Cascade so I always just found it really interesting because it's such a primitive part of who we are but they give us such useful information and they still very like an important part of our psychology so yeah I just that's cool I just find them fascinating yeah still still so many years later yeah yeah I know we can go down a massive Rabbit Hole here yeah one of the things I feel like might come up in this is this idea of they have this idea of a dimensional emotion classification and I'm like what does that mean like is that like how deep the emotion is or how strong the emotion is or like what what does that classification refer to yeah so it's more that instead of saying like you feel stronger or weaker fear or whatever you sort of break down aspects of emotion into different dimensions so okay the most common one is that you have polarity so polarity is sort of whether an emotion is positive or negative right um You have a arousal so it's sort of like how physiologically kind of worked up you get and then you have dominance which is the expression so it's sort of the idea that it's a combination of all three so you can have like okay a strongly negative emotion but you suppress it and then that would be sort of a negative polarity High arousal and high dominance so yeah it's just you get this sort of directionality it's very Vector if you will yeah yeah yeah yeah it's a 3D threedimensional Vector space it always come back that's cool that I guess that kind of is good then that kind of gives a way to kind of relate that information so yeah and I should just also say like the dominance model you can imagine how difficult it is to extract that from text so this is really not used as part of sentiment analysis but okay one aspect of it but you can you can feel those vibes in a in a one-on-one meeting with someone oh yeah yeah yeah yeah interesting I guess where do we start like where do we where do you want to dig in here you kind of have a few things listed we have a document we're kind of sharing here which I think is great and I feel like some of it is techniques that have been around for a while and then kind of moving into where more modern techniques have kind of developed yes so let's maybe first talk about it's probably something we talked about on other NLP episodes or you know other people have talked about it when working with text but something to be aware of is obviously a body of text is a complex document sometimes so it could be as short as a tweet sure or it could be as long as a novel right and so if you think about what kind of complexity of sentiment or topic or anything you expect to be expressed obviously going to be more variable the longer the document is and the more complex it is right so it's also really important to sort of think about okay at what level is it reasonable to try and analyze the sentiment of this piece of text like should I be doing it sentence by sentence is it is it reasonable to analyze the full text like maybe a review maybe you want to analyze the full review just to work out of it's positive or negative but there's also things like maybe you want to know how people feel about specific aspects this is called aspect based sentiment analysis unsurprisingly yeah so it could be let's go back to the review example let's say someone feels very positively about I don't know say it's a coffee machine they feel very positively about the design but they feel negatively about the amount of noise it makes so you can see you're getting kind of more information here rather than they just didn't like it or they did like it and then yeah you can also kind of combine it with other top like techniques so you might do something called topic modeling where you just classify what that sentence or what that document's about so then you know okay they're talking about the design of the coffee machine and they're talking about I don't know the type of coffee beans it takes I'm not really put it improvising that's okay yeah and so you could sort of say okay they're focused on functionality that would be the the topic so right how many features does this machine have yeah I'm a big coffee person so I have this fancy Breville oh I've had it for five years now so it's actually paid for itself easily and I make espresso in it has its own grinder so I kind of I get what you're talking about there and all the things if I wanted to do like lattes and all that sort of stuff like those kinds of features so you'd be looking at sort of I guess associations then as far as like where the the words are kind of connected to like you said like like the overall performance but it you know but noise might be another kind of category thing and then also like other features you're looking for and then I guess price would be another one that you could kind of say okay these words around price or whatever is allowing you to kind of divide up the sentiment yeah yeah and we'll kind of go through different ways of doing sentiment analysis so this will sort of overlap with what we're talking about but there are more and less sophisticated ways of extracting okay this information so yeah it's working we text is complicated that's I thought when you said mention like you could look at an entire novel and like oh my gosh that that would be really intense and you'd really have to be searching in there yes I I probably wouldn't recommend doing that I would probably say if you want to do that put it in a vector database but we can come back to that yeah we're much more likely looking at at our case of experimenting is like reviews or yeah yeah stuff like that yeah okay the first kind of technique I want to talk about was the first one I actually came across when I was looking at sentiment analysis and these are techniques that are called lexicon based so they're based on a dictionary or a lexicon and initially when you come across it it seems like like know I always feel like the documentation on these packages is maybe like not that clear because I feel like it's almost written for an audience that already understands how they work and so when I first came across them I thought wow like they they look so sophisticated and complicated but all they are under the hood I shouldn't say all they are so you know it's a lot of work that goes into it but yeah yeah yeah it's literally someone gets a dictionary of words a lexicon and then they give it to someone who will manually annotate each word with the sentiment associated with it so when we talk about sentiment with these lexicon based approaches we're really just talking about polarity which I mentioned earlier also called bence is how positive and how negative it is we're not talking about like specific emotions and yeah basically it's like okay the word sad has a sentiment of negative five or whatever the word happy has a sentiment of positive five whatever whatever the scale is yeah and then generally you will have someone who has some sort of understanding of linguistics involved in creating a rule set which will identify each of the words in the sentence match them to a lexicon and then combine them in some sort of generally sophisticated way so you know to take into account a negation if you say I'm not happy it will be like okay so we don't say happy is positive five here we say maybe it's ne3 okay so this is the general idea and these are a super common way of doing sentiment analysis they're sort of the oldest way of doing it and they're still very successful it's a lot of python packages not a lot but like a few commonly used ones and still work pretty well yeah I wonder about the process of keeping them up to [Music] date with the way language sort of changes or words being added to it I I have a tangent there's a book called power versus Force have you ever heard of that book no it's a I don't know if it's a religious book or like a psychological book or whatever but it's basically has to do with the power that words have and it has a raing scale similarly to what you're talking about and like how like negative sort of stuff and like one of the dividing lines is uh is pride pride is like this dividing line it's still sort of negative but entire military is run on pride and other things like that but it's you know and then like you know you get to Like Love and Other kinds of things it's interesting book I don't know how valid things are in it entirely but it's interesting because it's one of the few times I've looked at something that did what you're talking about of like enumerating you know or like coming up with a scale for words and how they are in your life and how you use them and and relate to things so I always think about the process of the annotators as well so one of the packages one of the most well-known ones is called Vader okay and I did look up the the acronym because I know you love knowing what the acronym stand for yes please this one stands for veilance aware dictionary and sentiment reason but sentiment the E is stressed so is the capital yeah I saw that it's so [Laughter] funny yeah it's a very much a bit of a backronym it's a very back yeah but calling it Vader is certainly convenient so all all Applause to the authors for doing that yeah yeah but I remember like I read the paper that they put out with it and they actually use mechanical TK to get all the ratings so they had like multiple people doing the ratings and you know try to get more reliability that way yeah I don't know if they're actually actively updating that dictionary though but they did take a lot of care to try and include things like abbreviations like SMH for shaking my head and they also include emojis and stuff like that so yeah yeah I still have a hard time with some of them oh my God I'm too old to be like that up to date on texting sometimes like what I I remember reading this thing which was like um my grandma thinks that LOL means lots of love and I got a text from her saying your grandpa's in hospital LOL grandma oh no Grandma no with a with the heart side next to it like whoa thought things are a bit better between them but anyway yeah exactly what are you gonna do we gotta laugh yeah that's right so yeah the other package in Python that does this sort of lexicon stuff in terms of like sort of main stream packages is text blob so okay does the same thing you input a sentence or a couple sentences and it will output sentiment score the polarity okay interesting thing about text blob is it also includes a subjectivity score so they did the same thing like they got people to annotate how subjective or objective particular words are and then they combine them and so okay yeah it's kind of interesting be a use case kind of thing like where like where would that be better I'm trying to think yeah so okay think about if you got a bunch of reviews sure and you want to know if the review that you're given is like someone staing backs like I checked into the hotel and this actually happened to me yes there was a stain on the bed and the shower was leaking that would be very objective okay versus someone who is like this is the worst hotel I've ever stayed in right right I hated it because they're both useful you want to kind of you know right potentially capture both the IE feel and the kind of like sort of up here kind of stuff as opposed to like down here looking at the thing specifically exactly and the objective stuff is more actionable as well right because again if we go back to the kind of topic modeling yeah we can fix that sh you can work out what's going on so yeah but I I find this really interesting because I've never seen this anywhere else that's cool yeah it's a quite cool package yeah it would make sense that you would want that in your tool belt maybe as a second pass on something to kind of give it an idea of like okay but this is or at least like okay I'm ready for all the subjective reviews um I'm going to look at the objective ones first and then because I can do something then I'll work out how much people hate or love me yes yeah exactly her feelings feelings [Music] yeah this week's sponsor is Sentry at Sentry we hit a two-c Slowdown in our API response times here's how we dug deep and fixed it fast while monitoring Sentry's backend we noticed our P99 response time meaning the slowest 1% of requests had spiked to over 2 seconds starting from our Trace view we discovered our rate limit middleware was the culprit with our profiling tools we were able to trace the issue to a classic cold start problem luckily the fix was relatively straightforward we adjusted our configuration to warm up our servers ahead of time and our P99 latency dropped to 176 milliseconds a win in our book at Sentry we don't just build tools we use them to debug our trickiest slowdowns read the whole story and others like it at blog. [Music] sentry.io do you have an example of using Vader that like a a project you put it in place I do actually so I will be honest I have never used sentiment analysis for like a production project it's more just been for little toy projects experimenting and okay yeah but back when I was better about maintaining my blog I used to try and do a Christmas blog post and a New Year's blog post and I think I've only ever succeeded at doing this for three years so and I've been maintaining this vog for like eight years it'll tell you how okay how good are yeah but I did early on when I was like still getting into sentiment analysis one which was about how people felt about their news resolutions so basically like okay you know I tried to do some very like it was very simple kind of topic modeling like it was literally just based on keywords that I thought would be associated with different themes and then tried to see like okay do people feel more positive about weight loss resolutions or travel resolutions blah blah blah so I think I use beta for that V was like I was a big fan of Vader okay back in the day so what were your results do you remember I think I didn't think there was a big difference between the different topics but also like my topic modeling was quite crap so that's probably part of it okay but um I would think that in the construction of them and and and statement of them they would generally be positive I want to do this thing as opposed to like the followup that rarely happens to like how did you accomplish them but it was also very interesting because I think it was probably the weight loss ones were quite P like self- punitive a lot of the time like oh yeah okay like oh I feel so disgusting after Christmas I should really get back into the gym for New Year's okay and then you're like oh yeah okay it's sad yeah yeah I think that one was the one with the the worst sentiment and then I think travel was the one with the the best sentiment yeah yeah kind of high expectations yeah but maybe I misquoted myself because it's been like three years since I looked at these blow par that's okay that's cool it's good to hear I mean that it's an interesting uh way to use that I would wonder it makes some sense where text blob would be a little bit handier you know in sort of different situations there with the sort of the reviewing things and and kind of trying to get almost like a bit of a temperature check with it so yeah yeah yeah and like you can like like there's a lot of sort of uh social media monitoring software and a lot of them will have sentiment analysis built into them and I'm sure it's probably based on these sort of packages let's say you can get charts which will show you the average sentiment of tweets or whatever talking about your company over the last month so yeah it it is actually this is probably one of my favorite applications of sentiment analysis because it's just so useful but then obviously like I think it's super cool being able to use it as a company to be like hey like we genuinely want to fix things with our product or right the way people are seeing our company so we can extract this information yeah I wonder about the the numeric scales that a lot of people use there's this NPS thing that I've been a part of because I've been in retail for a long time yeah off and on this idea of a net promoter score and I I generally feel people you should just give them a good or a bad I don't think having five star is is is worth doing and this is totally an opinion because I feel like people are like well what do I do with that and they pretty much just go one way or the other even if it's going to be like a a minor thing that they found wrong they're like one yeah yeah because they want it to have attention you know to it which is interesting so I feel like maybe this information you could then take the words with that and kind of balance it out yeah so and and the actually the interesting thing about these lexicon based approaches is they will generally give you continuous scores so because they because you know you can see Trends yes exactly so it's not just we predict that it's positive we predict that it's negative it's that it's more or less positive or negative um compared to these other techniques we're going to talk about they generally don't it's generally it's okay it's either positive or it's negative or you know very positive very negative whatever so okay yeah it's a useful thing about them nice so that kind of gets through most of the Lexicon stuff which is kind of like the first approaches that people might do for this kind of thing probably good to practice these just to kind of understand what's how they kind of work and how to set them up would you suggest people still play around with them oh yeah yeah yeah for sure and like you can easily get I think like think there tutorials like very easily available I'm actually about to publish a blog post it probably in a couple of weeks it might be in time depending when it comes out so yeah yeah all right we'll make sure we got it a tag there at least or at least link to your blog so people can check it out yes oh this will be pie charm so this will be on the official blog so but yeah basically in those blog posts I've got some Snippets of code on how to apply all of the techniques or almost all of them that I talk about cool but you know if you don't want to wait for the blog post or you don't want to read it there's also plenty of places you can look up and you'll see like these are literally like you import a method you apply that method to the text and it just spits out the results they're super easy to use can I ask you a question about when these packages say that they support multiple languages are there varying results with that like I feel like English is I'm privileged you know I live in an English country and most programming is in English and and so it's like very Centric and so I have this myopic view of like well what what happens when you try to look at reviews in in in other countries and and real python is an international site so we have a lot of people visiting from other countries and occasionally things come in in different languages but it's rare but I wonder about like I guess there's a couple questions there like okay well do they support Lang you know how many languages you know and and then how well is that done and is you know are there issues with that yeah this is such a great question so basically because the Lexicon based approaches all the dictionaries are handcrafted right okay there are two ways of doing this so as I understand it text blob has at least one project another language which is text blobde which is text blob in German okay from what I understand they have actually created a new dictionary and a new rule set specific to German because obviously ly the relationship between words is quite different so I can actually give you an example from German because okay supposedly I speak some German yeah yeah you need to but say we were talking about the negation example right so in English we would say I am not happy right so okay but you could say uh in German like I am not happy you could literally translate that but you could also put it in a verb form so God can I actually is it FY I think I'm just making something up but you can put in a noun form sorry so in that case the negation is not n it's so I have no happiness would be the translation yeah so you can sort of see that the verb form and the sorry not the verb form the adjective form and the noun form have quite different grammar rules and so doing a direct translation may not actually work so think makes me think of the topic of stemming we talked about way back then exactly yeah yeah yeah and and and even in English like feel goes to felt which is not easy to think of buy bought buying buyer it's like okay well there's things there but like bot is like so out of the loop there you know versus like a borrow borrowing borrow you kind of thing like that like you can kind of think of the stem and so I wonder about that in other languages where like the construction might be completely different like romance languages follow some nice rules generally like you can get around Spanish to French to Italian and structurally there's a lot of similarity there but like you go to German and it's like boom you blew it up and English is like so Random yeah but but even with Spanish like I am not happy would be no Fel so it's not even like the no is not next to Feliz again so oh it's it's yeah the beginning of yeah so it's huh okay yeah so what I do know is that there is a project called V multi to support multiple languages okay it does use the Google translate API and to be honest the Google translate API is good it's based on you know large language models I'm sorry they're already coming back in right so it can probably handle a bit of the the vagaries and weirdness here but then how that works with the rule set which I assume would be designed for English and then whether you know this the translation of the word has the same implications like yeah sentiment wise in different languages like I just feel like it it wouldn't work perfectly I feel like you need to handcraft this and yeah it's expensive and you need linguists and you need people who know what they're talking about to do this sort of stuff yeah I mean it'd be good to have people to check it out you know run through his paces and see yeah yeah actually and you know I think about like sarcasm and other things that are going to be like maybe just fly right past that sort of stuff really doesn't work um but if we have any listeners who natively speak another language and you try vadal multi and you compare it with English that would be very cool yeah yeah that'd be interesting yeah yeah I just wondered about that so have you have you practiced these tools with other languages other than English yeah so mostly done so one of my NLP jobs we worked across multiple languages right okay so again it wasn't sentiment analysis but it was more um and this is actually very good leadin kind of text Preparation like tools things to do things like stemming or lization or removing stopes and things like that yeah so there are definitely tools that exist across multiple languages for those and the really good thing is is you generally don't need to handcraft them yourself right like you don't need to come up with a list of stop words for German or for Spanish or you're going to get 90 some odd percent of it through automatic means or no no no it's it's that people have taken the time to compose this for you so you'll have some sort of expert in the language and they will have say as part of nltk that'll be part of the package yeah created a St word list that is appropriate for uh German say okay so great yeah and and to be honest they work quite well like a lot of the normalization tools like especially some of the lemmatization tools which just to remind anyone who hasn't listened to that episode yeah yeah it's basically taking all of the different grammatical variations of a word and reducing it down to its its base form um and it does it by applying proper grammar rules and so obviously it's extremely language dependent and yeah like a lot of lemmatization tools are super sophisticated and they do this very well for the three languages I can sort of understand it for yeah okay but yeah that is a good segue to talk about our next way of doing C analysis which is using just machine learning models this is very broad but the general idea is that you have a bunch of text and you have an emotion label associated with it again is usually just going to be positive or negative or it could be okay very positive positive neutral negative very negative or whatever some combination thereof I haven't really seen any of these models using emotions like fear anger Joy but there's no reason you couldn't do that if you've got the training set you can do that okay and then you process the text like we talked about you could just base it on the words and this is called a bag of words approach so if you want to learn more about that I think real python has an article about it and then we also talked about that in an earlier podcast yeah I have a couple links to a couple tools where we kind of go into specific sentiment analysis stuff uh is called first steps with Python's nltk library and then the other one is uh about classifying movie reviews and so I'll include links to both of those so people could go and then I did have an episode really early on episode 36 where I had uh we talked about sentimental for Trans firms and kind of diving a little bit further into you know some of my first data science episodes so so yeah generally the idea is you use the information in the text in some way so it could be the words in the text in which Cas you do those normalization steps we talked about or it could be something a bit more sophisticated like you convert the document into what's called a document embedding this is just a way of like you just take a basically a large language model and you convert the document into some sort of vector representation and the model will do all the magic for you okay and then you pass that into your model and then it will predict whether the text is positive or negative whatever so the advantage of using this technique is all of those problems we talked about with you know there's not language support for your particular model maybe maybe the the thing doesn't work for your specific domain whatever it can generally be solved with this approach because you're handcrafting a model that's specific for your use case now the disadvantage is you have to come up with a training set which means you're probably going to need to make the training set manually right and training machine learning models is is not always super easy so do you end up doing like a divide kind of thing where you're like well I want to analyze this group of stuff and to be able to train it you would you know maybe do a split situation of course you're losing some of the sentiment that you're looking for in that data there but um is that the best approach yeah absolutely so it's just sort of your classic machine learning training Pipeline and so what that involves is okay you will prepare the whole data set right you will just like say take 50,000 reviews or something and you will hand label all of them and then what you'll do is you'll split that into three you'll have your set that you use for training and you'll have a set for just kind of testing intermediate models yeah and then you'll keep one set aside so that's your validation set and then you'll keep one set aside which is your test set and that's basically the one that you test your final model on and you just want to know okay the performance on this final set is that going to reflect real world how it's going to perform in the real world when I actually want to apply this to Real reviews okay so it's standard machine learning practice with all of its uh yeah yeah work and complexity so in terms of tooling for this it's more kind of like your classic like python tooling for natural language processing so you've got packages like nltk natural language toolkit y you've got ssit learn of course for building models and um also preparing the data and you've got Spacey which is much more kind of sophisticated way of approaching this like the tooling pipeline is really like welldeveloped and allows you to do a lot of this sort of climatization stemming and stuff in a much more kind of say production ready way it's just sort of packaged up a bit more nicely more clearly I'm intrigued by the subtitle in their website industrial strength natural language processing yeah what do you think they mean by that like it's much more ready to take on these larger tasks or what have you it's it's partially about the size of the data it's more like about um when you build when you build any sort of of data processing right you have a pipeline yeah and so they've taken more of a sort of data engineering approach I think to how they've sort of built the code so everything is built on the basis of having endtoend pipelines where it will sort of take in a raw piece of text and spit out the process text that you need and you can also you can like tack on sentiment analyzers like textt blob you can put that as part of your pipeline you can tack on models you can tack on like embeddings like it just sort of tries to take everything that you might need to do to a piece of text and make it part of one coherent pipeline so it's just more sort of stable it's more reliable it is more powerful as well like the possibilities of what you can do a very wide that's cool it sounds like I forget the term but where you can kind of in in my world of music production it's like you can plug in all these different things and insert these different areas of stuff that you want you know choose Your Own Adventure as far as like okay I want to have these things go I mean I guess they might have a specific order that you might want to approach it but it's designed pipeline wise to to be something where you can grow and and expand or add the tools that you need to it so yeah I should also mention like the two founders of uh Spacey they're friends of mine and live in Berlin they're actually coming over for dinner tomorrow oh wow all right we'll say hi that's cool yeah it sounds like a really interesting package is kind of like built upon a lot of these other you know background tools that we've we've talked about up to now and it definitely looks like it has lots of support for lots of other languages I mean literally that's one of the first choices is like what language you want to work with yeah exactly which is nice have you been using Spacey for a while yeah yeah yeah so the job I was talking about where I was having to do multilingual text processing like we all we all use space so yeah I I started with psyit learn actually still love psyit learn but when you have to do more complex stuff uh space is definitely better and I think the code's a bit more readable when you come back to it nice yeah I think that idea of it being you know modular um would make it rather readable you can kind of see where you're doing each of the steps and looking at it [Music] this week I want to Shan a spotlight on another real python video course it's based on a topic related to our conversation this week and may help you get started in the world of natural language processing it's titled learn text classification with python and caras the course is based on a tutorial by Nikolai yanakiev and instructor Douglas starns takes you through getting started with scikit learn and the car's package defining a Baseline model how to use pre-trained word embeddings determining the mood of a piece of text through sentiment analysis what are convolution neural networks and learning how to tune hyperparameters if you're interested in exploring natural language processing and sentiment analysis I think this is a worthy investment of your time and like all the video courses on real python the course is broken into easily consumable sections plus you get additional resources and code examples for the technique shown all of our course lessons have a transcript including closed captions check out the video course you can find a link in the show notes or you can find it using our Search tool on real [Music] python. so this approach what would you say like our immediate advantages to it over the older systems yeah so the advantage like the main advantage as I said is you can tailor you can tailor the sentiment analysis exactly to the data that you have okay all right and it also sort of means that I don't know if you do want to create an emotion classifier model you can do that you can like it's basically whatever data you can prepare yeah but it may be more work than some of the other ones or it's definitely it's a different form of work it's it's it's it's more work no it's it's definitely more work because you basically got to train a machine learning model and prepare the data okay unless you already have it the kind of downside as well is that it's probably going to be slower maybe maybe not okay but the Lexicon based approaches are incredibly fast whereas depending on the model you build it might be a tad slower so you know okay because it's more matching things I don't know I'm trying to think like what what what's making it faster slower yeah proc I mean obviously building the model is a task yeah it's going to take take time and and a lot of preparation once that's done is it fast it it depends on the model so generally models that require more calculations can be a bit slower but unless unless you're talking about like a neural net okay it's probably not going to be like a huge difference in terms of performance which is probably a good way to start talking about llms yeah yeah where we're headed always come back yeah so I think I probably talked about on this show before I'm I'm an AI skeptic I want to put this up front so I'm not going to be trying to sell you on llms being the best solution for this they're just right the latest solution and they're a possible approach but they're not necessarily going to be the best approach okay so generally I'm not I'm not going to get into what llms are we've covered that on previous um episodes and I'm sure yeah sick yeah we' kind of built up to them in in in our descriptions of you know NLP to llm if you will yeah yeah yeah yeah yeah they've had an interesting couple years here uh as far as like changes and and and what's happening with them and and places that people are trying to put them yes but this is an interesting one I don't hear about it as much being used for sentiment analysis so kind of intrigued about like okay what's how do you how do you turn it in that direction you know as opposed to like chatting with it so yeah so probably just give a little bit of context about like the different implementations of llms yeah so we've talked about llms as I've already said but basically the llms that we're looking at at the moment like a lot of the chat ones are what are called Dakota models so they're just sort of based on a part of LM architecture and they designed to predict next words right so you create these models and then you do a process called instruction tuning and instruction tuning is where you take an llm that all it does is just spits words out until it comes to a stop token and then it stops okay you train it to expect questions and to Output answers so all you do take a model that's already been trained and you import questions and you make it predict these gold standard answers okay so you try and get it to Output a sequence as as close as possible and so these chat Bots that you're you know talking to at chat gbt or whatever these are instruction tune models but you don't have to instruction tune models you can do whatever you want and something else you could do is you can do this tuning it's called fine tuning so that it can predict emotions so you do like I said for the machine learning model you create a data set which has some sort of text and some sort of emotion attached to it and then what you do is you feed the text in and you say to the model you need to be able to predict this emotion now to kind of go back a step as well I should explain these don't have to be Dakota models they're actually more often what's called encoder models which are train to do different they're not necessarily trained to spit out next word they're designed to analyze by like encoding the information pulling the information in and looking at it their design it depends on the model so a long long time ago we talked about B which was actually the first ever encoder model and basically what it was designed to do was do um two things it was presented with two sentences and it was asked to First predict a missing word in the in the sentences and it was also asked to predict what's called entailment so which sentence preceded the other are they in the correct order okay so an example could be I live in France I speak blank and English so you know the entailment is correct because I live in France should go before the second sentence and the missing word would be French so okay it's just there's little of historic reasons that that they're trained on these particular types of data but the general idea is you're just trying to force these models to learn how to deal with natural language and learning those two tasks really forces the model to basically learn you know what is the order of sentences what's the kind of logical sequence what words belong in different context so there's nothing special about this task it's just way of training these models next word prediction is just another way of training these models because it scales super well and it's really easy to create training data compared to this entailment task I talked about but okay yeah so so in our case we're we're going to do that with sentiment yes in the sense that we're gonna kind of let let it know what we think the sentiment is like what okay all right great yeah so once we've trained up this encoder model that's really good at understanding language it doesn't have to be an encod mod model but it usually is basically what we can do is say okay I have say this book review and I think that the sentiment of this book review is very negative okay so I want you to take in that text and I want you to take everything that you understand about how language functions and see if you can understand what okay and then you pass in the next one I think this is positive see if you understand why and you can also that again with specific emotions so you can be like I feel like the description of this book is mostly evoking the emotion of Joy see if you can understand why and so through being exposed through to like maybe a few tens of thousands of examples hundreds of thousands maybe the models can then understand okay if I see this text I can understand the specific sentiment Associated based on what I understand about how language works so okay this is the general idea and the good news is you don't have to find tune it yourself because people have nicely done this for you thank you yes I I wonder if this is somewhat like it's different than brag which I've talked about on the show where you're taking an existing model and you're augmenting it with a domain space um like I had somebody talk about they were working with a company that had this set of almanacs that were all about you know weather and when you should plant what food wear and all that sort of stuff and it's very very domain specific information that they wanted to be able to have in here and then have the L if it can find that information within that data set it could you know provide it but also you could say if you if the question is not within this domain you could say don't try to you know hallucinate something out of it which I thought was interesting and so I wonder about is that is it a similar process like would you say or is it there like a unique kind of tuning that you're doing here it sounds more a little more peace meal as opposed to like vectorizing a database of of stuff yeah so it's more different purposes okay okay so I'm a little technical here but not too technical okay that's okay so basically llms contain what's been come to be called parametric knowledge I think we've talked about it on a previous show and yeah I think we talked about on the hallucination example uh hallucination episode so basically parametric knowledge is just information it's seen enough time during training and the model is big enough to be able to represent that so it means that if it comes across something that looks similar to it it can just you know vomit up this right sequence that it's it's memorized or you know something close to it so um the problem is of course is that models do not they don't learn everything because some things will be rarer but right they also will not be exposed to certain things like maybe you know 17th century Farmers almanacs right as part of their training so this is sort of the limitation you have now we're also kind of talking about two different tasks we're talking yeah yeah yeah so text classification which is what we're talking about with sentiment analysis in this context is like we don't want to be able to generate anything like from the whole kind of world of of things that the model could tell us we just want to predict these five labels or these six labels or whatever right we're almost doing analysis versus more of a generative type of thing yeah it's more it's more of a closed task like a okay as opposed to something completely open and so pretty much like the easiest way to squeeze more performance out of the model is you are introducing new information doing fine-tuning like in the same way that you are doing rag okay but you're focusing the model so that it can only produce these five possible labels and so it's got less potential error yeah now the interesting thing thing is you can actually get these like models that just do generation text generation you can get them to do this prediction as well so you can say to them hey here are five possible labels and here's a piece of text I want you to classify this text and certain models can actually do this relatively well this is called zero shot classification in that you basically haven't given it any examples you've just said is the task please do it now rag coming back

Original Description

What are current approaches for analyzing the emotions within a piece of text? What tools and Python packages should you use for sentiment analysis? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to discuss modern sentiment analysis in Python. 👉 Links from the show: https://realpython.com/podcasts/rpp/232/ Jodie has a PhD in clinical psychology. We discuss how her interest in studying emotions has continued across her career. Jodie covers three ways to approach sentiment analysis. We start by discussing traditional lexicon-based and machine-learning approaches. We then dive into how specific types of LLMs can be used for the task. We also share multiple resources so you can continue to explore sentiment analysis yourself. This week's episode is brought to you by Sentry. Topics: - 00:00:00 -- Introduction - 00:02:31 -- Conference talks in 2024 - 00:04:23 -- Background on sentiment analysis and studying feelings - 00:07:09 -- What led you to study emotions? - 00:08:57 -- Dimensional emotion classification - 00:10:42 -- Different types of sentiment analysis - 00:14:28 -- Lexicon-based approaches - 00:17:50 -- VADER - Valence Aware Dictionary and sEntiment Reasoner - 00:19:41 -- TextBlob and subjectivity scoring - 00:21:48 -- Sponsor: Sentry - 00:22:52 -- Measuring sentiment of New Year resolutions - 00:27:28 -- Lexicon-based approaches links for experimenting - 00:28:35 -- Multiple language support in lexicon-based packages - 00:35:23 -- Machine learning techniques - 00:39:20 -- Tools for this approach - 00:42:54 -- Video Course Spotlight - 00:44:15 -- Advantages to the machine learning models approach - 00:45:55 -- Large language model approach - 00:48:44 -- Encoder vs decoder models - 00:52:09 -- Comparing the concept of fine tuning - 00:56:49 -- Is this a recent development? - 00:58:08 -- Ways to practice with these techniques - 01:00:10 -- Do you find this to be a promising approach? - 01:07:45 -- Resourc
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Real Python · Real Python · 0 of 60

← Previous Next →
1 A better Python REPL – bpython vs python interpreter
A better Python REPL – bpython vs python interpreter
Real Python
2 Introducing large-type.com – A Utility Website
Introducing large-type.com – A Utility Website
Real Python
3 Reading Hacker News Without Wasting Tons of Time
Reading Hacker News Without Wasting Tons of Time
Real Python
4 Forward References and Python 3 Type Hints
Forward References and Python 3 Type Hints
Real Python
5 Using Sublime Text as your Git Editor
Using Sublime Text as your Git Editor
Real Python
6 Python Code Linting and Auto-Complete for Sublime Text
Python Code Linting and Auto-Complete for Sublime Text
Real Python
7 Make your Python Code More Readable with Custom Exceptions
Make your Python Code More Readable with Custom Exceptions
Real Python
8 Write Better Tests with Sublime Text's Split Layout Feature
Write Better Tests with Sublime Text's Split Layout Feature
Real Python
9 How to Use Sublime Text from the Command Line
How to Use Sublime Text from the Command Line
Real Python
10 Rename Variables with Multiple Selection in Sublime Text
Rename Variables with Multiple Selection in Sublime Text
Real Python
11 Sublime Text Settings for Writing PEP 8 Python
Sublime Text Settings for Writing PEP 8 Python
Real Python
12 Write Cleaner Python with Sublime Text's Indent Guides
Write Cleaner Python with Sublime Text's Indent Guides
Real Python
13 Sublime Text Whitespace Settings for Python Development
Sublime Text Whitespace Settings for Python Development
Real Python
14 Function Argument Unpacking in Python
Function Argument Unpacking in Python
Real Python
15 Python Code Review: Debugging and Refactoring "Conway's Game of Life" +  Automated Tests
Python Code Review: Debugging and Refactoring "Conway's Game of Life" + Automated Tests
Real Python
16 Using "get()" to Return a Default Value from a Python Dict
Using "get()" to Return a Default Value from a Python Dict
Real Python
17 A Python Shorthand for Swapping Two Variables
A Python Shorthand for Swapping Two Variables
Real Python
18 Python Code Review: Refactoring a Web Scraper, PEP 8 Style Guide Compliance, requirements.txt
Python Code Review: Refactoring a Web Scraper, PEP 8 Style Guide Compliance, requirements.txt
Real Python
19 Click & Jump to Test Failures from the Command Line (iTerm2)
Click & Jump to Test Failures from the Command Line (iTerm2)
Real Python
20 Setting up Sublime Text for Python Developers
Setting up Sublime Text for Python Developers
Real Python
21 Sublime Text + Python Guide Overview
Sublime Text + Python Guide Overview
Real Python
22 Python Code Review: Adding Pytest Tests to an Existing Python Web Scraper
Python Code Review: Adding Pytest Tests to an Existing Python Web Scraper
Real Python
23 Type-Checking Python Programs With Type Hints and mypy
Type-Checking Python Programs With Type Hints and mypy
Real Python
24 A Shorthand for Merging Dictionaries in Python 3.5+
A Shorthand for Merging Dictionaries in Python 3.5+
Real Python
25 Python Code Review Flask Web Security Tutorial + Virtualenvs, requirements.txt
Python Code Review Flask Web Security Tutorial + Virtualenvs, requirements.txt
Real Python
26 My Python Code Looks Ugly and Confusing – Help!
My Python Code Looks Ugly and Confusing – Help!
Real Python
27 Setting Up a Programmer Portfolio/Developer Blog – How To Get Started
Setting Up a Programmer Portfolio/Developer Blog – How To Get Started
Real Python
28 Do I Need a GitHub/GitLab/Bitbucket Profile as a Developer?
Do I Need a GitHub/GitLab/Bitbucket Profile as a Developer?
Real Python
29 Programmer Portfolio – Example and Walkthrough
Programmer Portfolio – Example and Walkthrough
Real Python
30 How to Get Your 1st Speaking Gig at a Tech Conference
How to Get Your 1st Speaking Gig at a Tech Conference
Real Python
31 How to Build Your Public Speaking Skills as a Developer
How to Build Your Public Speaking Skills as a Developer
Real Python
32 The Object-oriented Version of "Spaghetti Code" is "Lasagna Code" ?!
The Object-oriented Version of "Spaghetti Code" is "Lasagna Code" ?!
Real Python
33 Setting up Sublime Text for Python Developers – Lesson #1
Setting up Sublime Text for Python Developers – Lesson #1
Real Python
34 Cool New Features in Python 3.6
Cool New Features in Python 3.6
Real Python
35 "is" vs "==" in Python – What's the Difference? (And When to Use Each)
"is" vs "==" in Python – What's the Difference? (And When to Use Each)
Real Python
36 Emulating switch/case Statements in Python with Dictionaries
Emulating switch/case Statements in Python with Dictionaries
Real Python
37 Python Function Argument Unpacking Tutorial (* and ** Operators)
Python Function Argument Unpacking Tutorial (* and ** Operators)
Real Python
38 What Code Should I Put On My GitHub/GitLab/BitBucket Profile?
What Code Should I Put On My GitHub/GitLab/BitBucket Profile?
Real Python
39 A Crazy Python Dictionary Expression ?!
A Crazy Python Dictionary Expression ?!
Real Python
40 String Conversion in Python: When to Use __repr__ vs __str__
String Conversion in Python: When to Use __repr__ vs __str__
Real Python
41 Method Types in Python OOP: @classmethod, @staticmethod, and Instance Methods
Method Types in Python OOP: @classmethod, @staticmethod, and Instance Methods
Real Python
42 Optional Arguments in Python With *args and **kwargs
Optional Arguments in Python With *args and **kwargs
Real Python
43 Python Context Managers and the "with" Statement (__enter__ & __exit__)
Python Context Managers and the "with" Statement (__enter__ & __exit__)
Real Python
44 Installing Python Packages with pip and virtualenv / venv
Installing Python Packages with pip and virtualenv / venv
Real Python
45 "For Each" Loops in Python with enumerate() and range()
"For Each" Loops in Python with enumerate() and range()
Real Python
46 Python Code Review: LibreOffice Automation and the Python Standard Library
Python Code Review: LibreOffice Automation and the Python Standard Library
Real Python
47 Managing Python Dependencies With Pip and Virtual Environments – Lesson #1
Managing Python Dependencies With Pip and Virtual Environments – Lesson #1
Real Python
48 Python Tutorial: List Comprehensions Step-By-Step
Python Tutorial: List Comprehensions Step-By-Step
Real Python
49 Leveraging Python's Implicit "return None" Statements
Leveraging Python's Implicit "return None" Statements
Real Python
50 What's the meaning of underscores (_ & __) in Python variable names?
What's the meaning of underscores (_ & __) in Python variable names?
Real Python
51 Python Data Structures: Sets, Frozensets, and Multisets (Bags)
Python Data Structures: Sets, Frozensets, and Multisets (Bags)
Real Python
52 Writing automated tests for Python command-line apps and scripts
Writing automated tests for Python command-line apps and scripts
Real Python
53 How to find great Python packages on PyPI, the Python Package Repository
How to find great Python packages on PyPI, the Python Package Repository
Real Python
54 Immutable vs Mutable Objects in Python
Immutable vs Mutable Objects in Python
Real Python
55 PyPI vs Warehouse, the Next-Generation Python Package Repository
PyPI vs Warehouse, the Next-Generation Python Package Repository
Real Python
56 pep8.org — The Prettiest Way to View the PEP 8 Python Style Guide
pep8.org — The Prettiest Way to View the PEP 8 Python Style Guide
Real Python
57 My Experience at PyCon 2017 in Portland
My Experience at PyCon 2017 in Portland
Real Python
58 Pylint Tutorial – How to Write Clean Python
Pylint Tutorial – How to Write Clean Python
Real Python
59 "Reverse a List in Python" Tutorial: Three Methods & How-to Demos
"Reverse a List in Python" Tutorial: Three Methods & How-to Demos
Real Python
60 Python Refactoring: "while True" Infinite Loops & The "input" Function
Python Refactoring: "while True" Infinite Loops & The "input" Function
Real Python

This video teaches you how to approach sentiment analysis in Python, including traditional lexicon-based and machine learning approaches, as well as using LLMs. You'll learn about various tools and packages, such as VADER, TextBlob, and NLTK, and how to fine-tune models for specific tasks.

Key Takeaways
  1. Train a machine learning model to classify emotions
  2. Create a dataset with text and emotions attached
  3. Fine-tune an LLM to predict emotions
  4. Use instruction tuning to fine-tune an LLM
  5. Prepare data for sentiment analysis
💡 Fine-tuning LLMs can be an effective approach for sentiment analysis, but may require large amounts of data and computational resources.

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →