Natural Language Processing with spaCy & Python - Course for Beginners

freeCodeCamp.org · Beginner ·🧠 Large Language Models ·4y ago

Skills: LLM Foundations80%

Key Takeaways

This video course covers Natural Language Processing with spaCy and Python, including topics such as linguistic annotations, named entity recognition, word vectors, and pipelines.

Full Transcript

in this course you will learn all about natural language processing and how to apply it to real world problems using the spacey library dr mattingly is extremely knowledgeable in this area and he's an excellent teacher hi and welcome to this video my name is dr william mattingly and i specialize in multilingual natural language processing i come to nlp from a humanities perspective i have my phd in medieval history but i use spacey on a regular basis to do all of my nlp needs so what you're going to get out of this video over the next few hours is a basic understanding of what natural language processing is or nlp and also how to apply it to domain specific problems or problems that exist within your own area of expertise i happen to use this all the time to analyze historical documents or financial documents for my own personal investments over the next few hours you're going to learn a lot about nlp language as a whole and most importantly the spacey library i like the spacey library because it's easy to use and easy to also implement really kind of general solutions to general problems with the off-the-shelf models that are already available to you i'm going to walk you through in part one of this video series how to get the most out of spacey with these off-the-shelf features in part two we're going to start tackling some of the features that don't exist in off-the-shelf models and i'm going to show you how to use rules-based pipes or components in spacey to actually solve domain-specific problems in your own area from the entity ruler to the matcher to actually injecting robust complex regular expression or regex patterns in a custom spacey component that doesn't actually exist at the moment i'm going to be showing you all that in part two so that in part 3 we can take the lessons that we learned in part 1 and part 2 and actually apply them to solve a very kind of common problem that exists in nlp and that is information extraction from financial documents so finding things that are of relevance such as stocks markets indexes and stock exchanges if you join me over the next few hours you will leave this lesson with a good understand understanding of spacey and also a good understanding of kind of the off-the-shelf components that are there and a way to take the off-the-shelf components and apply them to your own domain if you also join me in this video and you like it please let me know in the comments down below because i am interested in making a second part to this video that will explore not only the rules based aspects of spacey but the machine learning based aspects of spacey so teaching you how to train your own models to do your own things such as training a dependency parser training a named entity recognizer things like this which are not covered in this video nevertheless if you join me for this one and you like it you will find part two much easier to understand so sit back relax and let's jump into what nlp is what kind of things you can do with nlp such as information extraction and what the spacey library is and how this course will be laid out if you like this video also consider subscribing to my channel python tutorials for digital humanities which is linked in the description down below even if you're not a digital humanist like me you will find these python tutorials useful because they take python and make it accessible to students of all levels specifically those who are beginners i walk you through not only the basics of python but also i walk you through step by step some of the more common libraries that you need a lot of the channel deals with texts or text-based problems but other content deals with things like machine learning and image classification and ocr all in python so before we begin with spacey i think we should spend a little bit of time talking about what nlp or natural language processing actually is natural language processing is the process by which we try to get a computer system to understand and parse and extract human language often times with raw text there are a couple different areas of natural language processing there is named entity recognition part of speech tagging syntactic parsing text categorization also known as text classification co-reference resolution machine translation adjacent to nlp is another kind of computational linguistics field called natural language understanding or nlu this is where we train computer systems to do things like relation extraction semantic parsing question and answering this is where bots really kind of come into play summarization sentiment analysis and paraphrasing nlp and nlu are used by a wide array of industries from finance industry all the way through to law and academia with researchers trying to do information extraction from texts within nlp there's a couple different applications the first and probably the most important is information extraction this is the process by which we try to get a computer system to extract information that we find relevant to our own research or needs so for example as we're going to see in part three of this video when we apply spacey to the financial sector a person interested in finances might need nlp to go through and extract things like company names stocks indexes things that are referenced within maybe news articles from reuters to new york times to wall street journal this is an example of using nlp to extract information a good way to think about nlp's application in this area is it takes in some unstructured data in this case raw text and extracts structured data from it or metadata so it finds the things that you want it to find and extracts them for you now while there's ways to do this with gusset tiers and list matching using an nlp framework like spacey which i'll talk about in just a second has certain advantages the main one being that you can use and leverage things that have been parsed syntactically or semantically so things like the part of speech of a word things like its dependencies things like its co-reference these are things that the spacey framework allow for you to do off the shelf and also train into machine learning models and work into pipelines with rules so that's kind of one aspect of nlp and one way it's used another way it's used is to read in data and classify it this is known as text categorization and we see that on the left hand side of this image text categorization or text classification and we conclude in this sentiment analysis for the most part as well is a way we take information into a computer system again unstructured data a raw text and we classify it in some way you've actually seen this at work for many decades now with spam detection spam detection is nearly perfect it needs to be continually updated but for the most part it is a solved problem the reason why you have emails that automatically go to your spam folder is because there's a machine learning model that sits on the background of your the back end of your email server and what it does is it actually looks at the emails it sees if it fits the pattern for what it's seen as spam before and it assigns it as spam label this is known as classification this is also used by researchers especially in the legal industry lawyers oftentimes receive hundreds of thousands of documents if not millions of documents they don't necessarily have the human time to go through and analyze every single document verbatim it is important to kind of get a quick umbrella sense of the documents without actually having to go through and read them page by page and so what lawyers will oftentimes do is use nlp to do classification and information extraction they will find keywords that are relevant to their case or they will find documents that are classified according to the relevant fields of their case and that way they can take a million documents and reduce it down to maybe only a handful maybe a thousand that they have to read verbatim this is a real world application of nlp or natural language processing and both of these tasks can be achieved through the spacey framework spacey is a framework for doing nlp right now as of 2021 it's only available i believe in python i think there is a community that's working on an application with r but i don't know that for certain but spacey is one of many nlp frameworks that python has available if you're interested in looking at all of them you can explore things like nltk the natural language toolkit stanza which i believe is coming out of the same program at stanford there's many out there but i find spacey to be the best of all of them for a couple different reasons reason one is that they provide for you off the shelf models that benchmark very well meaning they perform very quickly and they also have very good accuracy metrics such as precision recall and f score i'm not going to talk too much about the way we measure machine learning accuracy right now but know that they are quite good second spacey has the ability to leverage current natural language processing methods specifically transformer models also known usually kind of collectively as bert models even though that's not entirely accurate but it allows for you to use an off-the-shelf transformer model and third it provides the framework for doing custom training relatively easily compared to these other nlp frameworks that are out there finally the fourth reason why i pick spacey over other nlp frameworks is because it scales well spacey was designed by explosion ai and the entire purpose of spacey is to work at scale by at scale we mean working with large quantities of documents efficiently effectively and accurately spacey scales well because it can process hundreds of thousands of documents with relative ease in a relative short period of time especially if you stick with more rules-based pipes which we're going to talk about in part two of this video so those are the two things you really need to know about nlp and spacey in general we're going to talk about spacey in depth as we explore it both through this video and in the free textbook i provide to go along with this video which is located at spacey.pythonhumanities.com and it should be linked in the description down below this video and the textbook i meant to work in tandem some stuff that i cover in the video might not necessarily be in the textbook because it doesn't lend itself well to text representation and the same goes for the opposite some stuff that i don't have the time to cover verbatim in this video i cover in a little bit more depth in the video in the in the book i think that you should try to use both of these what i would recommend is doing one pass through this whole video watch it in its entirety and get an umbrella sense of everything that spacey can do and everything that we're going to cover i would then go back and try to replicate each stage of this process on a separate window or on a separate screen and try to kind of follow along in code and then i would go back through a third time and try to watch the first part where i'd talk about what we're going to be doing and try to do it on your own without looking at the textbook or the video if you can do that by your third pass you'll be in very good shape to start using spacey to solve your own domain specific problems nlp is a complex field and applying nlp is really complex but fortunately frameworks like spacey make this project and this process a lot easier i encourage you to spend a few hours in this video get to know spacey and i think you're going to find that you can do things that you didn't think possible in relative short order so sit back relax and enjoy this video series on spacey in order to use spacey you're first gonna have to install spacey now there's a few different ways to do this depending on your environment and your operating system i recommend going to spacey.io backslash usage and kind of enter in the correct framework that you're working with so if you're using mac os versus windows versus linux you can go through and in this very handy kind of user interface you can go through and select the different features that matter most to you i'm working with windows i'm going to be using pip in this case and i'm going to be doing everything on the cpu and i'm going to be working with english so i've established all of those different parameters and it goes through and it tells me exactly how to go through and install it using pip and the terminal so i encourage you to go through pause the video right now go ahead and install windows however you want to i'm going to be walking through how to install it within the jupiter notebook that we're going to be moving to in just a second i want you to not work with the gpu at all working with spacey on the gpu requires a lot more understanding about what the gpu is used for specifically in training machine learning models it requires you to have cuda installed correctly it requires a couple other things that i don't really have the time to get into in this video but we'll be addressing in a more advanced spacey tutorial video so for right now i recommend selecting your os selecting either can use pip or conda and then selecting cpu and since you're going to be working through this video with english texts i encourage you to select english right now and go ahead and just install or download the encore web sm model this is the small model i'll talk about that in just a second so the first thing we're going to do in our jupiter notebook is we are going to be using the the exclamation mark to delineate in the cell that this is a terminal command we're going to say pip install spacey your output when you execute this cell is going to look a little different than mine i already have spacey installed in this environment and so mine kind of goes through and looks like this yours will actually go through and instead of saying requirement already satisfied it'll be actually passing out the the different things that it's actually installing to install spacey and all of its dependencies the next thing that you're going to do is you're going to again you follow the instructions and you're going to be doing python dash m space spacey space download and then the model that you want to download so let's go ahead and do that right now so let's go ahead and say python m spacey download so this is a spacey terminal command and we're going to download the n core web sm and again i already have this model downloaded so on my end spacey is going to look a little differently than as it's going to look on your end as it prints off on the jupiter notebook and if we give it just a second everything will go through and it says that it's collected it's downloading it and we are all very happy now and so now that we've got spacey installed correctly and that we've got the small model downloaded correctly we can go ahead and start actually using spacey and make sure everything's correct the first thing we're going to do is we're going to import the spacey library as you would with any other python library if you're not familiar with this a library is simply a set of classes and functions that you can import into a a python script so that you don't have to write a whole bunch of extra code libraries are massive collections of classes and functions that you can call so when we import spacey we're importing the whole library of spacey and now that we've see something like this we know that spacey has imported correctly as long as you're not getting an error message everything wasn't was imported fine the next thing that we need to do is we want to make sure that our english core web sm our small english model was downloaded correctly so the next thing that we need to do is we need to create an nlp object i'm going to be talking a lot more about this as we move forward right now this is just troubleshooting to make sure that we've installed spacey correctly and we've downloaded our model correctly so we're going to use the spacey dot load command this is going to take one argument it's going to be a string that is going to correspond to the model that you've installed in this case n core web sm and if you execute this cell and you have no errors you have successfully installed spacey correctly and you've downloaded the english core web sm model correctly so go ahead take time and get all this stuff set up pause the video if you need to and then pop back and we're going to start actually working through the basics of spacey i'm now going to move into kind of an overview of kind of what's within spacey why it's useful and kind of some of the basic features of it that you need to be familiar with and i'm going to be working from the jupiter notebook that i talked about in the introduction to this video if we scroll down to the bottom of chapter one the basics of spacey and you get past the install section you get to this section on containers so what are containers well containers within spacey are objects that contain a large quantity of data about a text there are several different containers that you can work with in spacey there's the dock the doc bin example language lexeme span span group and token we're going to be dealing with the lexeme a little bit in this video series and we're going to be dealing with the language container a little bit in this video series but really the three big things that we're going to be talking about again and again is the doc the span and the token and i think when you first come to spacey there's a little bit of a learning curve about what these things are what they do how they are structured hierarchically and for that reason i've created this in my opinion kind of easy to understand image of what different containers are so if you think about what spacey is as a pyramid so a hierarchical system we've got all these different containers structured around really the doc object your doc container or your doc object contains a whole bunch of metadata about the text that you pass to the spacey pipeline which we're going to see in practice in just a few minutes the doc object contains a bunch of different things it contains attributes these attributes can be things like like sentences so if you iterate over doc.sense you can actually access all the different sentences found within that doc object if you iterate over each individual item or index in your doc object you can get individual tokens tokens are going to be things like words or punctuation marks something within your sentence or text that has a self-contained important value either syntactically or semantically so this is going to be things like words a comma a period a semicolon a quotation mark things like this these are all going to be your tokens and we're going to see how tokens are a little different than just splitting words up with traditional string methods and python the next thing that you should be kind of familiar with are spans so spans are important because they kind of exist within and without of the doc object so unlike the token which is an index of the doc object a span can be a token itself but it can also be a sequence of multiple tokens we're going to see that at play so imagine if you had a a span in its category maybe group one are um are places so a single token might be like a city like berlin but span group two this could be something like full proper names so of of people for example so this could be like as we're going to see martin luther king this would be a sequence of tokens a sequence of three different items in the sentence that make up one span or one self-contained item so martin luther king would be a person who's a collection of a sequence of individual tokens if that doesn't make sense right now this image will be reinforced as we go through and learn more about spacey and practice for right now i want you to be just understanding that the doc object is the thing around which all of spacey sits this is going to be the object that you create this is going to be the object that contains all the metadata that you need to access and this is going to be the object that you try to essentially improve with different custom components factories and pipelines as you go through and do more advanced things with spacey we're going to now see in just a few seconds how that doc object is kind of similar to the text itself but how it's very very different and much more powerful we're now going to be moving on to chapter two of this textbook which is going to deal with kind of getting used to the in-depth features of spacey if you want to pause the video or keep this notebook or this book open up kind of separate from this video and follow along as we go through and explore it in live coding we're going to be talking about a few different things as we explore chapter two this will be a lot longer than chapter one we're gonna be not only importing spacey but actually going through and loading up a model creating a doc object around that model so that we're gonna work with the doc container and practice and then we're going to see how that dot container stores a lot of different features or metadata or attributes about the text and while they look the same on the surface they're actually quite different so let's go ahead and work within our same jupiter notebook where we've imported spacey and we have already created the nlp object the first thing that i want to do is i want to open up a text to start working with within this repo we've got a data folder within this data subfolder i've got a couple different wikipedia openings i've got one on mlk that we're going to be using a little later in this video and i have one on the united states this is wiki underscore us that's going to be what we work with right now so let's use our with operator and open up data backslash wiki underscore us.txt we're going to just read that in as f and then we're going to create this text object which is going to be equal to f dot read and now that we've got our text object created let's go ahead and see what this looks like so let's print text and we see that it's a standard wikipedia article kind of follows that same introductory format and it's about four or five paragraphs long with a lot of the features left in such as the brackets that delineate some kind of a footnote we're not going to worry too much about cleaning this up right now because we're interested not with cleaning our data so much as just starting to work with the doc object and spacey so the first thing that you want to do is you're going to want to create a doc object it is oftentimes good practice if you're only ever working with one doc object in your script to just call your only object doc if you're working with multiple objects sometimes you'll say doc1.2.3 or give it some kind of specific name so that your variables can be unique and easily identifiable later in your script since we're just working with one doc object right now we're going to say doc is equal to nlp so this is going to call our nlp model that we imported earlier in this case the english core web sm model and that's going to for right now just take one argument and that's going to be the text itself so the text object if you execute that cell you should have a doc object now created let's print off that doc object and see what it looks like and if you scroll down you might be thinking to yourself this looks very very similar if not identical to what i just saw a second ago and in fact on the surface it is very similar to that text object that we gave to the nlp model our pipeline but let's see how they're different let's print off the length of text and let's print off the length of the doc object and what we have here are two different numbers our text is 3525 and our doc object is 652 what is going on here well let's get a sense by trying to iterate over the text object and iterating over the doc object with a simple for loop so we're going to say for token and text so we can iterate first over that text object we're going to print off the token so the first 10 indices and we get individual letters as one might expect but when we do something the same thing with the doc object let's go ahead and start writing this out we're going to say for token and doc and we're going to iterate over the first 10 we're going to print off the token we see something very different what we see here are tokens this is why the doc object is so much more valuable and this is why the doc object has a different length than the text object the text object is just basically counting up every instance of a character a white space a punctuation etc the doc object is counting individual tokens so any word any punctuation etc that's why they're of different length and that's why when we print them off we see something different so you might now already be seeing the power of spacey it allows for you to easily on the surface with nothing else being done easily split up your text into individual tokens without any effort at all now those of you familiar with python and different string methods might be thinking to yourself but you know i've got the split method i can i can just use this to split up the text i don't need anything fancy from spacey well you'd be wrong let me demonstrate this right now so if i were to say for token and text dot split so i'm splitting up that text into individual in theory individual words essentially it's just a split method where it's splitting by individual white spaces if i were to do that and iterate over the first 10 again and i would just say print token it looks good until you get down here so until you get to usa well why is it a problem the problem is quite simple there is a parenthesis mark right here and this is where we have a a huge advantage with spacey spacey automatically separates out these these kind of punctuation marks and removes them from individual tokens when they're not relevant to the token itself notice that usa has got a period within the middle of it it's not looking at that and thinking that that is some kind of unique token a u a period an s a period and an a and a period it's not seeing these as four individual tokens rather it's automatically identifying them as one thing one tied together single token that's a string of characters and punctuation this is where the power of spacey really lies just on the surface level and go ahead spend a few minutes and play around with this and then we're going to kind of jump back here and start talking about how the doc object has a lot more than just tokens within it it's got sentences each token has attributes we're going to start exploring these when you pop back if you're following along with the textbook we're now going to be moving on to the next section which is sentence boundary detection in nlp sentence boundary detection is the identification of sentences within a text on the surface this might look simple you might be thinking to yourself i could simply use the split function and split up a text with a simple period and that's going to give me all my sentences those of you who have tried to do this might already be shaking your heads and saying no if you think about it there's a really easy explanation for why this doesn't work were you to try to split up a text by period and make a presumption that anything that occurs with between periods is going to be an individual sentence you would have a serious mistake when you get to things like usa especially in western languages where the punctuation of a period mark is used not only to delineate the change of a sentence rather it's used to also delineate abbreviations so united states of america each period represents an abbreviated word so you could write in rules to kind of account for this you could write in rules that could also include in other ways that sentences are created such as question marks such as exclamation marks but why do that that's a lot of effort when the doc object in spacey does this for you and let's go ahead and demonstrate exactly how that works so let's go ahead and say for sent and doc dot sense notice that we're saying doc dot send so we're grabbing the sentence attribute of the doc object let's print off send and if you do that you are now able to print off every individual sentence so the entire text has been tokenized at the sentence level in other words spacey has used its sentence boundary detection and done all that for you and given you all the sentences if you work with different models of different sizes you're going to notice that certain models the larger they get tend to do better at sentence detection and that's because machine learning models tend to do a little bit better than heuristic approaches the english core web sm model while having some machine learning components in it does not save word vectors and so the larger you go with the models typically the better you're going to have with regards to sentence detection let's go ahead and try to access one of these sentences so let's create an object called sentence one we're going to make that equal to dot dot doc dot sense zero so we're going to try to grab that zero index and let's print off sentence one if we do this we get an error why have we gotten an error well it tells you why right here it's a type error and this means that this is not a type that can be kind of iterated over it's not subscriptable and it's because it is a generator now in python if you're familiar with generators you might be thinking to yourself there's a solution for this and in fact there is if you want to work with generator objects you need to convert them into a list so let's say sentence one is equal to list so using the list function to convert doc dot sense into a list and then with outside of that we're gonna grab zero the zero index and then we're gonna print off sentence one and we grab the first sentence of that text this as we go deeper and deeper and spacey one by one you're going to see the immense power that you can do with paseo all the immense incredible things you can use spacey for with very very minimal code the doc object does a lot of things for you that would take hours to actually write out in code to do with heuristic approaches this is now a great way to segment an entire text up by sentence and if you work with text a lot you'll already know that this has a lot of applications as we move forward we're going to not just talk about sentences we're also going to be talking about token attributes because within the doc object are individual tokens i encourage you to pause here and go ahead and play around with the doc.sense a little bit and get familiar with how it works what it contains and try to convert it into a list once you've done that pop back here and we'll continue talking about tokens this is where i really encourage you to spend a little bit of time with the textbook under token attributes in chapter 2 i have all the different kind of major things that you're going to be using with regards to token attributes we're going to look and see how to access them in just a second i've provided for you kind of the most important ones that you should probably be familiar with we're going to see this in code in just a second and i'm going to explain with a little bit more detail than what's in the speaking spacey documentation about what these different things are why they're useful and how they're used so let's go ahead and jump back into our jupiter notebook and start talking about token attributes if you remember the doc object had a sequence of tokens so for token and doc you could print off token and let's just do this with the first 10 and we've got each individual token what you don't see here is that each individual token has a bunch of metadata buried within it these metadata are things that we call attributes or different things about that token that you can access through the spacey framework so let's go ahead and try to do that right now let's just work with for right now token number two which we're going to call sentence one and we're going to grab from sentence one the second index let's print off that word and it should be states and in fact it is fantastic so now that we've got the word states accessed we can start kind of going through and playing around with some of the attributes that that word actually has now when you print it off it looks like a regular piece of text looks like just a string but it's got so much more buried within it now because it's been passed through our nlp model our pipeline from spacey so let's go ahead and say token2.text and i'm going to be saying token2.txt if you're working within an ide like adam you're going to need to say print token2.text when we do this we see we get a string that just is states this is telling us that the dot text of the object the pure text corresponds to the word states this is really important if you need to extract the text itself from the token and not work with the token object which has behind it a whole bunch of different metadata that we're going to go through now and start accessing let's use the the token left edge so we can say token2 dot left underscore edge and we can print that off well what's that telling us it's telling us that this is part of a multi-word token or a token that is multiple has multiple components to make up a larger span and that this is the leftmost token that corresponds to it so this is going to be the word the as in the united states let's take a look at the right edge we can say token2 dot right underscore edge print that off and we get the word america so we're able to see where this token fits within a larger span in this case a noun chunk which we're going to explore in just a few minutes but we also learn a lot about it um kind of the different components so we know where to grab it from the beginning and from the very end so that's how the left edge and the right edge work we also have within this token 2 dot and type this is going to be the type of entity now what you're seeing here is a integer so this is 384 in order to actually know what 384 means i encourage you to not really use that so much as and type with an underscore after it this is going to give you the string corresponding to number 384 in this case it is gpe or geopolitical entity we're going to be working with named entity a little bit in this video but i have a whole other book on named entity recognition it's at ner.pythonhumanities.com in which i explore all of ner both machine learning and rules based in a lot more depth let's go ahead and keep on moving on though and looking at different entity types here as well so are not entity types attribute types so we're going to say token2 dot int i o b all lowercase and again an underscore at the end and we get the the string here i now iob is a specif a specific kind of named entity code a b would mean that it's the beginning of an entity and i means that it's inside of an entity and o means that it's outside of an entity the the fact that we're seeing i here tells us that this word states is inside of a larger entity and in fact we know that because we've seen the left edge and we've seen the right edge it's inside of the united states of america so it's part of a larger entity at hand we can also say token2 dot lima and under case again after that and we get the word states this is the limit form or the root form of the word this means that this is what the word looks like with no inflection if we were working with a verb in fact let's go ahead and do that right now let's grab sentence uh we're gonna grab sentence 1 index 12 which should be the word no and we're going to print off the lemma for the word sorry it's a verb and we see the verb lemma as no so if we were to print off sentence one specifically index 12 we see that its original form is known so the lima form uninflected is the verb no k k-n-o-w another thing that we can access and we're going to see that have the power of this later on this might not seem important right now but i promise you it will be let's print off token what did i call this again token 2. we're going to print that off but we're going to print off specifically the morph no underscore here just morph what you get is what looks like a really weird output a string called noun type equal to prop in fact this means proper noun a number which corresponds to sing we're going to talk a lot more about morphological analysis later on when we try to find and extract information from our texts but for right now understand that what you're looking at is the output of kind of what that word is morphologically so in this case it's a proper noun and it's singular if we were to do take this sentence 12 again and do morph we'd find out what kind of verb it is so it's a perfect past participle known perfect past participle remember being good at nlp is also being good with language so i encourage you to spend time and start getting familiar with those things that you might have forgotten about from like 5th grade grammar such as perfect participles and things like that because when you need to start creating rules to extract information you're going to find those pieces of information very important for writing rules we'll talk about that in a little bit though let's go back to our other attributes from the token so again let's go to token2 and we're going to grab the part of speech not what you might be thinking so part of speech underscore pos underscore and we output p r o p n this means that it is a proper noun it's more of a of a simpler kind of grammatical extraction as opposed to this morphological detailed extraction what kind of noun it might be with regards to in this case singular so that's going to be how you extract the part of speech another thing you can do is you can extract the dependency relation so in this case we can figure out what role it plays in the sentence in this case the noun subject and then finally the last thing i really want to talk about before we move into a more detailed analysis of part of speech is going to be the token 2 dot lang and what this grabs for you is the language of the doc object in this case we're working with something from the english language so en every language is going to have two letters that correspond to it these are universally recognized so that's going to be how you access different kinds of attributes that each token has and there's about 20 more of these or maybe not 20 maybe about 15 more of these that i haven't covered i gave you the ones that are the most important that i find to be used on a regular basis to solve different problems with regards to information extraction from the text so that's going to be where we stop here with token attributes and we're going to be moving on to part 2.5 of the book which is part of speech tagging i now want to move into kind of a more detailed analysis of part of speech within spacey and the dependency sparser parser and how to actually analyze it really nicely either in a notebook or outside of a notebook so let's work with a different text for just a few minutes we're going to see why this is important it's because i'm working on a zoomed in screen and to make this sentence a little easier to understand we're going to just use mike and joy's plane football a very simple sentence and we're going to create a new doc object and we're going to call this doc 2 that's going to be equal to nlp text let's print off doc 2 just to make sure that it was created and in fact that we see that it was now that we've got it created let's iterate over the tokens within this and say for token in text we want to print off token dot text we want to see what the text actually is we want to see the token dot pos and the token dot dep oh it helps if you actually iterate over the correct object over the doc2 object and we see that we've got mike proper noun noun subject and joy's verb it's the root uh plain in this case it's a verb and then we've got football the noun the direct object and a period which is the punctuation so we can see the basic uh semantics of the sentence at play what's really nice from spacey is we have a way to really visualize this information and how these words relate to one another so we can say from spacey import displacy and we're going to do displacy display c dot render and this is going to take two arguments it's going to be the text and then it's going to be the actually it's going to be doc 2 and then it's going to be style in this case we're going to be working with dep and we're going to print that off and we actually see how sentence is structured now in the textbook i use a more complicated sentence but for the reasons of this video i've kept it a little shorter just because i think it displays better on the screen because you can see that this becomes a little bit more difficult to understand when you're zoomed in but this is one sentence from that wikipedia article so go ahead and look at the textbook and see how elaborate this is you can see how it's part of a compound how it's a preposition you can see the the more fine-tuned grain fine-grained aspects of the dependency parser and the part of speech tagger really at play with more complicated sentences so that's going to be how you really access part of speech and how you can start to visualize how words in a sentence are connected to other words in the sentence with regards to their part of speech and their dependencies that's going to be where we stop with that in the next section we're going to be talking about named entity recognition and how to visualize that information so named entity recognition is a very common nlp task it's part of kind of data extraction or information extraction from texts it's oftentimes just called ner named entity recognition i have a whole book on how to do any r with python and with spacey but we're not going to be talking about all the ins and outs right now we're just going to be talking about how to access the the pieces of information throughout kind of our our text and then we're going to be dealing with a lot of ner as we try to create elaborate systems to do named entity extraction for things like financial analysis let's go ahead and figure out how to iterate over a doc object so we're going to say for int and doc.n so we're going to go back to that original doc the one that's got the the first kind of the text from wikipedia on the united states we're going to say print off and dot text so the the text from it and end dot label label underscore here that's going to tell us what label corresponds to that text and we print this off we've got a lot of gpes which are geopolitical entities on north america this isn't a geopolitical entity it's just a general location 50 a cardinal number five cardinal number norp indian in this case which is a national or religious political entity quantity the number of miles canada gpe as you would expect paleo indians norp once again siberia lock and we have date being extracted so at least 12 000 years ago this is a small model and it's extracting for us a lot of very important structured data but we can see that the small model makes mistakes so the revolutionary wars being considered an organization where i to use a large model right now which i can download separately from spacey we're going to be seeing this later in this video or where i'd use the much larger transformer model this would be correctly identified most likely as an event not as an organization but because this is a small model that doesn't contain word vectors which we're going to talk about in just a little bit it does not generalize or make predictions well on this particular data nevertheless we do see really good extraction here we have the american civil war being extracted as an event we have the spanish-american war even with this encoding typographical era here and world war being extracted as an event world war ii event cold war event all of this is looking good uh and not really i only saw a couple basic mistakes but for the most part this is what you'd expect to see we even see percentages extracted correctly here so this is how you access really vital information about your tokens but more importantly about the entities found within your text and also displacy offers a really nice way to visualize this in a jupiter notebook we can say displacey.render we can say doc style we can say end and we get this really nice visualization where each entity has its own particular color so you can see where these entities appear within the text as you kind of just naturally read it you can do this with the text as long as you want you can even change the max length to be more than a million characters long and again we can see right here org is incorrectly identified as the american revolutionary war and correctly identified as org but nevertheless we see really really good results with a small english model without a lot of custom fine-tuned training and there's a reason for this a lot of wikipedia data gets included into machine learning models so machine learning models on text typically make good predictions on wikipedia data because it was included in their training process nevertheless these are still good results if i'm right or wrong on that i'm not entirely certain but that's going to be how you kind of extract important entities from your text and most importantly visualize it this is where chapter 2 of my book kind of ends after this chapter you have a good understanding hopefully of kind of what the dot container is what tokens are and how the doc object contains the attributes such as sense and ends which allows for you to find sentences and entities within a text hopefully you also have a good understanding of how to access the linguistic features of each token through token attributes i encourage you to spend a lot of time becoming familiar with these basics as these basics are the building block for really robust things that we're going to be getting into in the next few lessons we're now moving into chapter 3 of our textbook on spacey and python now in chapter three we're going to be continuing our theme of part one where we're trying to understand the larger building blocks of spacey even though this video is not going to deal with spacing machine learning approaches are custom ones that is it's still important to be familiar with what machine learning is and how it works specifically with regards to language because a lot of the spacey models such as the medium large and transformer models all are machine learning models that have word vectors stored within them this means that they're going to be larger more accurate and do things a bit bit more slowly uh depending upon its size so we're going to be working through not only what kind of machine learning is generally but specifically how to how it works with regards to texts and i think that this is where you're going to find this textbook to be somewhat helpful so what i want to do is in our new jupyter notebook we're going to import spacey just as we did before but this time we're going to be installing a new model so we're going to do python m the exclamation mark python m spacey download and then we're going to download the n core web md model so this is the medium english model this is going to take a little longer to download and the reason why i'm having you download the media model and the reason why we're going to be using the media model is because the medium model has stored within it word vectors well that's downloading let's go ahead and talk a little bit about what word vectors are and how they're useful so word vectors are word embeddings so these are numerical representations of words in multi-dimensional space through matrices that's a very compacted sentence so let's break it down what are word vectors used for well they're used for a computer system to understand what a word actually means so computers can't really parse text all that efficiently they can't parse it at all every word needs to be converted into some kind of a number now for some old approaches you would use something like a bag of words approach where each individual word would have a corresponding number to it this would be a unique number that corresponds just to that word for a lot of tasks that that can work but for something like text understanding or trying to get a computer system to be able to understand how how a word functions within a sentence in general so in other words how it works in the in the language how it relates to all other words that doesn't really work for us so what a word vector is is it's a multi-dimensional representation so instead of a a number having just a single integer that corresponds to it it instead has what looks like to to an unsuspecting eye essentially it has a very complex sequence of floating numbers that are stored as an array which is a computationally less expensive form of a list in python or just computing in general and this is what it looks like a long sequence in this case i believe it's a 300 dimensional word that corresponds to a a specific word so this is what an array or a word vector or a word embedding looks like what this means to a computer system is it means syntactical and semantical meaning so the way word vectors are typically trained is oh there's a few different approaches but kind of the old school word to vec approach is you give a computer system a whole bunch of texts and different smaller larger collections of texts and what it does is it reads through all of them and figures out how words are used in relation to other words and so what it's able to essentially do through this training process is figure out meaning and what that meaning allows for a computer system to do is understand how a word might relate to other words within a sentence or within a language as a whole in order to understand this i think it's best if we move away from this textbook and actually try to explore what word vectors look like in spacey so you can have a better sense of specifically what they do why they're useful and how you as a nlp practitioner can go ahead and start leveraging them so just like before we're going to create an nlp object this time however instead of loading in our encore web sm we're going to load in our n core web m d so the one that actually has these word vectors store these static vectors saved and it's going to be a larger model let's go ahead and execute that cell and while that's executing we're going to start opening up our text so we're going to say with open data wiki underscore us.txt r as f we're going to say text is equal to f dot read so we're going to successfully load in that text file and open it up then we're going to create our doc object which will be equal to nlp text all the syntax is staying the exact same and just like before let's grab the first sentence so we're going to convert our doc dot sense generator into a list and we're going to grab index zero and let's go ahead and print off sentence one just so you can kind of see it and there it is so now that we've got that kind of in memory we can start kind of working with it a little bit so let's go ahead and just start tackling how we can actually use word vectors with spacey so let's kind of think about a general question right now let's say i wanted to know how the word let's say country is similar to other words within our model's word embeddings so let's create a little way we can do this we're going to say your word and this is going to be equal to the word country country there we go and what we can do is we can say ms is equal to nlp so we're going to go into the nlp object we're going to grab the vocab not vectors and we're going to say most similar and this is a little complicated way of doing it in fact i'm going to go ahead and just kind of copy and paste this in you have the code already in your in your textbook that you can follow along with and i'm going to go ahead and just copy and paste it in right here and print off this what this is going to do is it is going to go ahead and just do this entirely there we go and we have to import numpy as mp this lets us actually work with the data as a numpy array and when we execute this cell what we get is an output that tells us all the words that are most similar to the word country so in this scenario the word country it has these kind of all these different similar words to it from the word country to the word country capitalized nation nation now it's important to understand what you're seeing here what you're seeing is not necessarily a synonym for the word country rather what you're seeing is are the words that are the most similar now this can be anything from a synonym to a variant spelling of that word to something that occurs frequently alongside of it so for example world while this isn't the same we would never consider world to be the synonym of country but what happens is is syntactically they're used in very similar situations so the way you describe a country is sometimes the way you would describe your world or maybe it's something to do with the hierarchy so a country is found within the world this is a good way to understand it so it's always good to use this word as most similar not to be something like synonym so when you're talking about word vector similarity you're not talking about synonym similarity keep that in mind but this is a way you can kind of quickly get a sense so what does this do for you why did i go through and explain all these things about word vectors if i'm not going to be talking about machine learning a whole bunch throughout this video well i did it so that you can do one thing that's really important and that's calculate document similarity in spacey so we've already got our nlp model loaded up let's create one object so we're going to make doc1 we're going to make that equal to nlp and we're going to create the text right here in this object so let's say this is coming straight from the spacey documentation i like salty fries and hamburgers and we're going to say doc 2 is equal to nlp and this is going to be the text fast food tastes very good and now we can do is let's go ahead and load those into memory what we can do is we can actually make a calculation using spacey to find out how similar they actually are these two different sentences so we can say print off doc1 and we're going to say this again this is coming straight from the spacey documentation doc2 so you're going to be able to see what both documents are and then we're going to do doc 1 dot similarity so we can go into the doc 1 dot similarity method and we can compare it to doc two we can print that off so what we're seeing here on the left is document one this little divider thing that we printed off here on the right we have document two and then we can see the degree of similarity between document 1 and document 2. let's create another doc object we're going to call this nlp doc3 and we're going to make this nlp let's come up with a sentence that's completely different the empire state building is in new york so this is one i'm just making up off the top of my head right now i'm going to copy and paste this down and we're going to compare this to doc one we're going to compare it to doc3 and we get a score of 0.51 so this is less similar to than these two so this is a way that you can take a whole bunch of documents you can create a simple for loop and you can find and start clustering the documents that have a lot of overlap or similarity how is this similarity being calculated well it's being calculated because what spacey is doing is it's going into its word embeddings and even though in these two situations we're not using the word fast food ever in this document it's going in and it knows that salty fries and hamburgers are probably in a close cluster with the bigram or a token that's made up of two words a bigram of fast food so what it's doing is it's assigning a prediction that these two are still somewhat similar more similar than these two because of these overlapping in words so let's try one more example see if we get something that's really really close so let's take doc 4 and this is going to be equal to nlp i enjoy oranges and then we're going to have doc 5 is going to be equal to nlp i enjoy apples so 2 i would agree i would argue very very syntactically similar sentences and we're going to do doc 4 here doc 5 here and we're going to look and see a similarity between doc 4 and doc 5. if we execute this we get a similarity of 0.96 so this is really high this is telling me that these two sentences are very similar and it's not just that they're similar because of the the similar syntax here that's definitely pushing the number up it's that what the individual is liking in the scenario between these two texts they're both fruits let's try something different let's make doc five let's just make doc 6 here and do something like this nlp i enjoy what's another word we could say something that's different let's say burgers something different from a fruit so we're gonna make doc six like that and we're gonna again copy and paste this down copy and paste this down we're gonna put dot six here and we see this drop so what this demonstrates i'm really glad this worked because i improvised this uh what this demonstrates is that the similarity the number that's given is not dependent on the contextual words rather it's dependent upon the semantic similarity of the words so apples and oranges are in a similar cluster around fruit because of their word embeddings the word burgers while still being food and still being plural is different from apples and oranges so in other words this similarity is being calculated based on something that we humans would calculate difference in meaning based on a a large understanding of a language as a whole that's where word vectors really come into play this allows you to calculate other things as well so you could even calculate the difference between salty fries and hamburgers for example i've got this example ready to go in the textbook let's go ahead and try this as well so we're going to grab doc 1 and and print off these few things right here so we're gonna try to calculate the similarity between french fries and burgers and what we get is a similarity of 0.73 so if we were to maybe change this up a little bit and try to calculate the similarity between maybe just the word uh burgers rather than hamburgers and burgers and hamburgers we would have a much higher similarity so my point is is play around with the similarity calculator play around with this structure the code i provided here and get familiar with how spacey can help you kind of find a similarity not just between documents but between words as well and we're going to be seeing how this is useful later on but again it's good to be familiar with kind of generally how machine learning kind of functions here in this context and why these medium and large models are so much bigger they're so much bigger because they have more word vectors that are much deeper and the transformer model is much larger because it was trained in a completely different method than the way the medium and large models were trained but again that's out of the scope for this video i now want to turn to the really the last subject of this introduction uh to spacey part one which is when we're taking this large umbrella view of spacey and in the textbook it's going to correspond to chapter four so what we go over in this textbook is kind of a large view of the not just the doc container and the word vectors and the linguistic annotations but really kind of the structure of the spacey uh framework which comes around the the pipeline so a pipeline is a very common expression in computer science and in data science think of it as a traditional pipeline that you would see in a house now think of a pipeline being a sequence of different pipes each pipe in a computer system is going to perform some kind of permutation or some action on a piece of data as it goes through the pipeline and as each pipe has a chance to act and make changes to and additions to that data the later pipes get to benefit from those changes so this is very common when you're thinking about logic of code i provided like a little image here that i think maybe might help you so if we imagine some input sentence right so some input text is entering a spacey pipeline it's going to go through a bunch of things if you're working with the medium model or the small model that'll tokenize it and give it a word and vector for different words it'll uh it'll also you know find the pos the part of speech the dependency part so we'll act on it but it might eventually get to a an entity ruler which we're going to see in just a few minutes uh the entity ruler will be a series of rules based ner named entity recognition so it'll maybe assign uh an a token to an entity might be the beginning of a of an entity might be the end of an entity might just be an individual token entity and then what will happen is is that doc object as it kind of goes through this pipeline will now receive a bunch of doc dot ins so it'll be this pipe will actually add to the um to the dock object as it goes through the pipeline the entity component and then the next pipeline the entity linker might take all those entities and try to find out which ones they are so it'll oftentimes be connected to some kind of wiki data some kind of standardized number that corresponds to a specific person so for example if you were seeing a bunch of things like paul something paul something maybe that one paul something might be paul hollywood from the great british bake off and it might have to make a connection to a specific person so if it's a the word paul being used generally this entity linker would assign it to paul hollywood depending on the context that's out of the scope of this video series but keep in mind that that pipe would do something else that would modify the ins that would give them greater specificity and then what you'd be left with is the dop object on the output that not only has entities annotated but it's also got entities linked to some generic specific data so that's going to be how a pipeline works and this is really what spacey is it's it's a sequence of pipes that act on your data and that's important to understand because it means that as you add things to a spacey pipeline you need to be very conscientious about where they're outed and in what order as we're going to see as we move over to kind of rules-based spacey when we start talking about these different types the entity ruler the matcher custom components regex components you're going to need to know which order to put them in it's going to be very important so do please keep that in mind now spacey has a bunch of different attribute rulers or different pipes you can kind of add into it you've got dependency parsers that are going to come standard with all of your models you've got the entity linker and d recognizer entity ruler you're going to have to make these yourself and add them in oftentimes you've got a limitizer this is going to be on most of your standard models your morpholog that's going to be on on there as well sentence recognizer sentenceizer this is what allow for you to have the doc.sense right here span categorizer this will help categorize different span spans be them single token spans or sequence of token spans your tagger this will tag the different things in your text which will help with part of speech your text categorizer this is when you train a machine learning model to recognize different categories of a text so text classification which is a a very important machine learning task toke to vec this is going to be what assigns word embeddings to the different words in your doc object tokenizer is what breaks that thing up and all all your text into individual tokens and you got things like transformer and trainable pipes then within this you've also got some other things called matchers so you can do some dependency matching we're not going to get into that in this video you've also got the ability to use matcher and phrase matcher these are a lot of the times can do some similar things but they're executed a little differently to make things less confusing i'm really only talking about the matcher of these two and if there's a need for it i'll add into the textbook the phrase matcher at a later date but i'm not going to cover it in this video and if i do add in the phrase matcher it's going to be after this mattress section here i have it in the github repo i just haven't included in the textbook to keep things a little bit simpler at least if you're just starting out so a big good question is well how do you add pipes to a spacey pipeline so let's go ahead and do that we're going to make a blank spacey pipeline right now let's go ahead and just make uh we'll just work with the same live coding notebook that we have open right now so what we're going to do is we're going to make a blank model and we're going to actually add in our own sentenceizer to our to our text so let's go ahead and do that so i'm going to say nlp is equal to spacey.blank this is going to allow for me to make a blank spacing a pipeline and i'm going to say en so that it knows that the tokenizer that i need to use is the english tokenizer and now if i want to add a pipe to that i can use one of the built-in spacey features so i can say add underscore pipe and i can say sentenceizer so i can add in a sentenceizer this is going to allow for me to create a pipeline now that has a sequence of two different pipes and i demonstrate in the textbook why this is important sometimes what you need to do is you need to just only break down a text into individual sentences so i grabbed a massive massive corpus from the internet which is on mit.edu and it's the entire shakespeare corpus and i just try to calculate the the quantity of sentences found within it there are 94 133 sentences and it took me only 7.54 seconds to actually go through and count those sentences with the spacing model using the small model however it took a total amount of time of 47 minutes to actually break down all those sentences and extract them why is there a difference in time between 7 seconds and 47 minutes it's because that this spacey small model has a bunch of other pipes in it that are trying to do a bunch of other things if you just need to do one task it's always a good idea to just activate one pipe or maybe make a blank model and just add that single pipe or the only pipes that you need to it a great example of this is needing to tokenize a whole bunch of sentences in relatively short time so i don't know about you but i'd be much happier with 7 seconds versus 47 minutes that however comes at a trade-off the small model is going to be more accurate in how it finds sentence boundaries and so we have a difference in quantity here this difference in quantity indicates that this one messed up and made some mistakes because it was just the sentenceizer the sentenceizer didn't have extra data being fed to it in fact if i probably used larger models i might even have better results but always think about that if time is of the essence and you don't care so much about accuracy a great way to get the quantity of sentences or at least a ballpark is to use this method where you simply add in a synthesizer to a blank model so that's how you actually add in different different pipes to a spacey pipeline and we're going to be reinforcing that skill as we go through especially in part two where we really kind of work with this in a lot of detail right now i'm just interested in giving you the general understanding of how this might work so let's go ahead and try to analyze our pipeline so we can do analyze underscore pipes and we can analyze what our analyzer we go we can actually analyze our pipeline if we look at the nlp object which is our blink model with the synthesizer we see that our nlp pipeline or sorry our nlp pipeline ignore summary ignore this bit here but what you're actually able to kind of go through and see right away is that we've really just got the sentence sizer sitting in it if we were to analyze a much more robust pipeline so let's create nlp 2 as equal to spacey dot load and core web sm we're going to create that nlp 2 object around the small space english model we can analyze the pipes again and we see a much more elaborate pipeline so what are we looking at well what we're looking at is a sequence of things we've got uh in the pipeline a tagger after the toktovec we've got a tagger a parser we keep on going down we've got an attribute ruler we've got a limitizer we've got the ner that's what designs the doc.ns and we keep on going down we can see the limitizer but we can see also a whole bunch of other things we can see what these different things actually assign so doc dot ends assigns the ner and require and we can also see what each pipe might actually require so if we look up here we see that the ner pipe so the named recognition pipe is responsible for assigning the doc.ns so that attribute of the the doc object and it's also responsible at the token level for assigning the end.iob underscore iob which is the if you remember from a few minutes ago when we talked about the iob being the uh opening beginning or out beginning inside for a different entity uh it also assigns the int dot and underscore type for each token attribute so you can see a lot of different things about your pipeline by using nlp dot analyze underscore pipes if you've gotten to this point in the video then i think you should by now have a good really umbrella view of what spacey is how it works why it's useful and some of the basic features that it can do and how it can solve some pretty complex problems with some pretty simple lines of code what we're going to see now moving forward is how you as a practitioner of nlp cannot just take what's given to you with spacey but start working with it and start leveraging it for your own uses so taking what is already available so like these models like the english model and adding to them contributing to them maybe you want to make uh an entity ruler where you can find more entities in a text based on some kazateer or list that you have maybe you want to make a matcher so you can find specific sequences within a text maybe that's important for information extraction maybe you need to add custom functions or components into a spacey pipeline i'm going to be going through in part two rules based spacey and giving you all the basics of how to do some really robust custom things relatively quickly with a within the spacey framework all of that's going to lay the groundwork so that in part three we can start applying all these skills and still start solving some real world problems in this case we're going to look at financial analysis so that's going to be where we move to next is part two we are now moving into part two of this jupiter book on spacey and we're going to be working with rules based spacey now this is really kind of the bread and butter of this video you've gotten a sense of the umbrella structure of spacey as a framework you've gotten a sense of what the doc container can contain you've gotten a sense of the the token attributes and the linguistic annotations from part one of this book in the earlier part of this video now we're going to move into taking those skills and really developing them into custom components and modified pipes that exist within spacey in other words i'm going to show you how to take what we've learned now and start really doing more robust and sophisticated things with that knowledge so we're going to be working first with the entity ruler then with the matcher in the next chapter then in the components in spacey so a custom component is a custom function that you can put into a pipeline then we're going to talk about regex or regular expressions and then we're going to talk about some advanced regex with spacey if you don't know what red x is i'm going to cover this in chapter 8. so let's go over to our jupiter notebook that we're going to be using for our entity ruler lesson so let's go ahead and execute some of these cells and then i'm going to be talking about it in just a second first i want to take some time to explain what the entity ruler is as a pipe and spacey what it's used for why you'd find it useful and when to actually implement it so there are two different ways in which you can kind of add in custom features to a spacey language pipeline there is a rules-based approach and a machine learning-based approach rules-based approaches should be used when you can think about how to generate a set of rules based on either a list of known things or a set of rules that can be generated through regex code or linguistic features machine learning is when you don't know how to comp like to actually write out the rules or the rules that you would need to write out would be exceptionally complicated a great example of a rules-based approach versus a machine learning based approach and when to use them is with entity types for named entity recognition imagine if you wanted to extract dates from a from a text there are a finite very finite number of ways that a date can appear in a text you could have something like january 1 2005 you could have one january 2005 you could have one jan 2005 you could have one slash five slash 2005. there's there's different ways that you can do this and there's a lot of them but there really is a finite number that you could easily write a regex expression for a regular expression for to capture all of those and in fact those regex expressions already exist that's why spacey is already really good at identifying dates so dates are something that you would probably use a rules-based approach for something that's a good machine learning approach for are something like names if you wanted to capture the names of people you would have to generate an entity ruler with a whole bunch of robust features so you would have to have a list of all known possible first names all known possible last names all known possible prefixes like doctor mr mrs miss miss it's uh master etc and you'd have to have a list of all known suffixes so junior senior the third the fourth etc on the list this would be very very difficult to write because first of all the quantity of names that exist in the world are massive the quantity of last names that exist in the world is massive there's not a set gazetteer or set list out there of these anywhere so for this reason oftentimes things like people names will be worked into machine learning components i'm going to address machine learning in another video at a later date for right now we're going to focus on a rules-based approach so using the the rules-based features that spacey offers a good nlp practitioner will be excellent at both rules based approaches and machine learning based approaches and knowing when to use which approach and when maybe maybe a task is not appropriate for machine learning when it can be worked in with rules relatively well if you're taking a rules-based approach the approach that you take should have a high degree of confidence that the rules will always return true positives and you need to think about that if you are okay with your rules maybe catching a few false positives or missing a few true positives then maybe think about how you write the rules and allowing for those and making it known in your documentation so that's generally what a rules-based approach is and an entity ruler is a way that we can use a list or a series of features language features to add tokens into the entity the dot ins container within the doc container so let's go ahead and try to do this right now the text we're going to be working with is a kind of fun one i think so if you've already gotten the reference congratulations it's kind of obscure but we're going to have a sentence right here that i just wrote out west chesterton fieldville was referenced in mr deeds so in this context we are going to have a few different entities we want our model or our pipeline to extract west chesterton fieldville as a gpe it's a fake place that doesn't really exist it was made up in the movie mr deeds and what we want is for mr deeds to be grabbed as an entity as well and this would ideally be labeled as a film but in this case that's probably not going to happen let's go ahead and see what does happen so we're going to say for end and doc dot ends print off and dot text and dot label like we learned from our ner lesson a few a few moments ago and we see that the output looks like this it's gotten almost all the entities that we wanted mister was left off of deeds and it's grabbed the west chesterton fieldville and labeled it as a person so what's gone wrong here well there's a few different things that have gone wrong the encore web sm model is a machine learning model for any r the word vectors are not saved so the static vectors are not in it so it's making the best prediction that it can but even with a very robust machine learning model unless it has seen west chesterton fieldville there is not really a good way for the model to actually know that that's a place unless it's seen a structure like west chesterton and maybe it can make up a guess a transformer model might actually get this right but for the most part this is a very challenging thing this would be challenging for a human there's not a lot of context here to tell you what this kind of entity is unless you knew a lot about how maybe northeastern villages and towns in the north america would be called also mr deeds is not extracted as a whole entity just deeds's now ideally we would have an nar model that would label west chesterton fieldville as a gpe and mr deeds as a film but we've got two problems one the machine learning model doesn't have film as an entity type and on top of that westchester tinfieldville is not coming out correct as gpe so our goal right now is to fix both of these problems with an entity ruler this would be useful if i were maybe doing some text analysis on fictional places referenced in films so things like narnia maybe middle earth west chesterton fieldville these would all be classified as kind of fictional places so let's go ahead and make a ruler to correct this problem so what we're going to do is first we're going to make a ruler by saying ruler is equal to nlp dot add pipe and this is going to take one argument here you're going to find out when we start working with custom components that you can have a few different arguments here especially if you create your own custom components but for right now we're working with the components that come standard with spacey there's about 18 of them one of them is the entity underscore ruler all lower case we're going to add that ruler into our nlp model and if we do nlp.analyze underscore pipes and execute that we can now look at our ner model and see as we go down that the the ner pipe is here and the entity ruler is now the exit the final pipe in our pipeline so we see that it has been successfully added let's go ahead now and try to add patterns into that pipeline patterns are the things that the spacey model is going to look for and the label that it's going to assign when it finds something that meets that pattern this will always be a list of lists so let's go ahead and do this right now sorry a list of dictionaries so the first pattern that we're really looking for here is going to be a dictionary it's going to have one key of label which is going to be equal to gpe and another label of pattern which is going to be equal to in this case we want to find west chesterton fieldville let me go ahead and just copy and paste it so i don't make a mistake here and what we want to do is we want our entity ruler to see west chesterton fieldville and when it sees it assign the label of gpe so it's a geopolitical entity so it's a place so let's go ahead and execute that great we've got the patterns now comes time to load them into the ruler so we can say ruler dot add underscore patterns this is going to take one argument it's going to be our list of patterns add it in cool now let's create a new doc object we're going to call this doc2 that's going to be equal to nlp we're going to pass in that same text we're going to say for ent and doc2.ens print off and dot text and end dot label you're going to notice that nothing has changed so why has nothing changed we're still getting the same results and we've added the correct pattern in the answer lies into one key thing if we look back up here we see that our entity ruler comes after our nar what does that mean well imagine how the pipeline works that i talked about a little while ago in this video a pipeline works by different components adding things to an object and making changes to it in this case adding ends to it and then making those things isolated from later pipes from being able to overwrite them unless specified what this means is that when west chesterton field bill goes through and is identified by the ner pipe as a person it can no longer be identified as anything else what this means is that you need to do one of two things give your ruler the ability to overwrite the nur or this is my personal preference put it before the ner in the pipeline so let's go through and solve this common problem right now we're going to create a new nlp object called nlp2 which is going to be equal to spacey.load and again we're going to load in the english core web sms model ncor web sm great and again we're going to do ruler dot nlp2 dot add pipe entity ruler and we're going to make that an object too now what we can do is we can say ruler.add patterns again we're going to go through all of these steps that we just went through we're going to add in those patterns that we created up above and now what we're going to do is we're going to actually do one thing a little different than what we did what we're going to do is we're going to load this up again and we're going to do an extra keyword argument now we can say either after or before here we're going to say before any r what this is going to do is it's going to place our ner before our entity ruler before the ner component and now when we add our patterns in we can now create a new doc object doc it's going to be equal to nlp2 text and we're going to save for int and doc dot ins print off int dot text and dot label and now we notice that it is correctly labeled as a gpe why is this well let's take a look at our nlp2 object i'm going to analyze pipes and if we scroll down we will notice that our entity ruler now in the pipeline sits before the ner model in other words we've given primacy to our custom entity ruler so that it's going to have the first shot at actually correctly identifying these things but we've got another problem here deeds is coming out as a person it should be mr deeds as the as the entire as the entire collective multi-word token and that should be a new entity we can use the entity ruler to add in custom types of labels here so let's go ahead and do this same thing let's let's go ahead and just copy and paste our patterns and we're going to create one more nlp object we're going to call this nlp3 is equal to spacey.load in core web sm great we've got that loaded up we're going to do the same thing we did last time nlp three or sorry ruler is equal to nlp.add underscore pipe entity ruler we're going to place it remember we're going to place it before the ner pipe oh nlp three there we go and what we need to do now is we need to copy in these patterns and we're gonna add in one more pattern remember this can be a list here so this pattern we're gonna have a new label called film and we're going to look for the sequence mr deeds and that's going to be our pattern that we want to add in to our ruler so we can do ruler.add underscore patterns we're going to add in patterns remember that one keyword argument or one argument is going to be the list itself and now we can create a new doc object which is going to be equal to nlp 3 i think i called it yep text and we can say for int and doc dot print off end dot text and end dot label and if we execute this we see now that not only have you gotten the entity ruler to correctly identify west chesterton fieldville we've also gotten the entity ruler to identify correctly mr deeds as a film now some of you might be realizing the problem here this is actually a problem for machine learning models and the reason for this is because mr deeds in some instances could be the person and mr deeds in other instances could be the movie itself this is what we would call a toponym it's a spell like this this is a common problem in natural language processing and it's actually one of the few problems or one of many problems really they remain a little bit unsolved toponym resolution it's about like this or tr is the resolution of toponym so things that can have multiple labels that are dependent upon context another example of toponym resolution is something like this if you were to look at this word and let's say let's ignore paris hilton let's ignore paris from greek mythology let's say it's only going to ever be a gpe the word paris could refer to paris france paris kentucky or paris texas topanim resolution is also the ability to resolve problems like this when in context is paris was kind of talking about paris france when in context is it talking about kentucky and when in context is it talking about texas so that's something that you really want to think about with when you're generating your rules for an entity ruler is is this ever going to be a false positive and if the answer is that it's going to be a false positive half the time or um it's a 50 50 shot then then really consider incorporating that kind of an entity into a machine learning model by giving it examples of both mr deeds in this case as a film and mr deeds as a person so we can learn with word embeddings when that context means it's a film and when that context means it's a person that's just a little toy example what we're going to see moving forward though and we're going to do this with the matcher not with the entity ruler is that spacey can do a lot of things you might be thinking to yourself now i could easily just come up with a list and just check and see whenever mr deeds pops up and just inject that into the doc.ins i could do the same thing with west chesterton field build why do i need an nlp framework to do this and the answer is going to come up in just a few minutes when we start realizing that spacey can do a lot more than things like regex or things like just a basic gazetteer check or a list to check what you can do with spacey is you can have the pattern not just take a sequence of characters and look for a match but a sequence of linguistic features as well that earlier pipes have identified and i think it's best if we save that for just a second when we start talking about the matcher which is in my opinion one of the more robust things that you can do with spacey and what sets spacey apart from things like regex or fancy or other fancy or string matching approaches okay we're now moving into chapter six of this book and this is really kind of my opinion one of the most important areas in this entire video if you can master the techniques i'm going to show you for the next maybe 20 minutes or so maybe 30 minutes you're going to be able to do a lot with spacey and you're really going to see really kind of its true power a lot of the stuff that we talk about here in the matcher can also be implemented in the entity ruler as well with a pattern the the key difference between the entity ruler and the matcher is in how data the data is kind of extracted so the matcher is going to store information um a little differently it's going to store it as within the vocab of the nlp model it's going to store it as a unique identifier or a lexiem spelled lex eme i'll talk about that more in just a second and it's not going to store in the doc ends so matchers don't put things in your in your doc dot ends so when do you want to use a match over an entity ruler you want to use the entity ruler when the thing that you're trying to extract is something that is important to have a label that corresponds to it within the entities that are coming out so i in my research i use this for anything from like let's say stocks if i'm working with finances i'll use this for if i'm working with holocaust data at the us hmm where i am a postdoc i'll try to add in camps and ghettos because those are all important annotated alongside other entities i'll also work in things like ships so the names of ships streets things like that when i use the the matcher it's when i'm looking for something that is not necessarily an entity type but something that is a a structure within the text that'll help me extract information and i think that'll make more sense as we go through and i show you kind of how to improve examples uh going through it we're kind of using the matcher as you would in the real world but remember all the patterns that i show you can also be implemented in the entity ruler and i'm also going to talk about when we get to chapter 8 how regex can actually be used to do similar things but in a different way essentially when you want to use the matcher or the entity ruler over regex is when linguistic components so the lemma of a word or the identifying if the the word is a specific type of an entity that's when you're going to want to use the matcher over regex and when you want to use regex is when you really have a complicated pattern that you need to extract and that pattern is not dependent upon specific parts of speech you're going gonna see with that how that works as we kind of go through the rest of part two but keep that in the back of your mind so let's go ahead and take our work over to our blank jupiter notebook again so what we're gonna do is we're gonna just set up with a basic example we need to import spacey and since we're working with the matcher we also need to say from spacey dot matcher import matcher with a capital m very important capital m once we have this loaded up we can start actually working with the matcher and we're going to be putting the matcher in a just a small english model and we're going to say nlp is equal to spacey.load and you should be getting familiar with this encore web sm the small english model once we've got that loaded and we do now we can start actually working with the matcher so how do you create the matcher well the pythonic way to do this and the way it's in the documentation is to call the object a matcher that's going to be equal to matcher with a capital m so we're calling this class right here and now what we need to do is we need to pass in one argument this is going to be nlp.vocab going to see that we can add in some extra features here in just a little bit i'm going to show you why you'd want to add an extra features at this stage but we're going to ignore that for right now what we're going to try to do is we're going to try to find email addresses within a text a very simple task that's really not that difficult to do we can do it with a very simple pattern because spacey has given us that ability so let's create a pattern and that's going to be equal to a list which is going to be equal to or that's which is going to contain a dictionary the first item in the dictionary or the first key is going to be the the thing that you're looking for so in this case we have a bunch of different things that the matcher can look for and i'm going to be talking about all those in just a second but one of them is very handily this this label of like email so if the if the string or the sequence of tokens or the token is looking like an email and that's true then that is what we want to extract we want to extract everything that looks like an email and to make sure that this occurs we're going to say matcher dot add and then here we're going to pass in two arguments argument one is going to be the think of it as a label that we want to assign to it and this is what's going to be added into the nlp.vocab as a lex scene which we'll see in just a second and the next thing is a pattern and it's important here to note that this is a list the argument here takes a list of lists and because this is just one list right now i'm making it into a list so each one of these different patterns would be a list within a list essentially the let's go ahead and execute that and now we're going to say doc is equal to nlp and i'm going to add in a text that i have in the textbook and this is my email address w mattingly aol.com uh that might be a real email address i don't believe it is it's definitely not mine so don't try and email it and then we're gonna say matches is equal to matcher doc and this is gonna be how we find our matches we pass that doc object into our matcher class and now what we have is the ability to print off our matches and what we get is a list and this list is a set of tuples that will always have three indices so index zero is going to be this very long number what this is is this is a lexeme spelled like this ali x eme it's in the textbook and the next thing is the start token and the end token so you might be seeing the importance here already what we can do with this is we can actually go into the nlp vocab where this integer lies and find what it corresponds to so this is where this is pretty cool check this out so you print off nlp.vocab so we're going into that vocab object we're going to index it it matches zero so this is going to be the the first index so this tuple at this point and then we're going to grab index 0. so now we've gone into this list we've gone to index 0 this first tuple and now we're grabbing that first item there now what we need to do is we need to say dot text need to do it right here if we print this off we get this email address that label that we gave it up there was added into the nlp vocab with this unique lex seam that allows for us to understand what that number corresponds to within the nlp framework so this is a very simple example of how a matcher works and how you can use it to do some pretty cool things but let's take a moment let's pause and let's see what we can do with this matcher so if we go up into spacey's documentation on the matcher we'll see that you've got a couple different attributes you can work with now we've uh we're going to be seeing this a little bit the orth this is the exact verbatim of a token and we're also going to see text the exact verbatim uh text of a token what we also have is lower so what you can do here is you can use lower to say when the item is lower case and it looks like and then give some lowercase pattern this is going to be very useful for capturing things that might be at the start of a sentence for example if you were to look for the penguin in the text anywhere you saw the penguin if you used a pattern that was just lowercase you wouldn't catch the penguin being at the start of a sentence it would miss it because the t would be capitalized by using lower you can ensure that your pattern that you're giving it is going to be looking for any pattern that matches that when the text is lowercased length is going to be the uh the length of your token text is alpha is ascii is digit this is when your characters are either going to be alphabetical ascii characters so uh the american standard coding initiative i can't remember what it stands for but it's that i think it's 128-bit thing that america came up with when they started encoding text it's now replaced with utf-8 and is digit is going to look for something if it is a digit so think of each of these as a token so if the token is a digit then that counts in the pattern is lower is upper is title these should be all self-explanatory if it's lower case if it's uppercase if it's a title so capitalized and if you don't understand what all these do right now i'm going to be going through and showing you in just a second just giving you an overview of different things that can be included within the the matcher or the entity ruler here so what we can also do is find something that uh if the token is actually the start of a sentence if it's like a number like a url like an email you can extract it and here is the main part i want to talk about because this is where you're really going to find spacey outshines any other string matching system out there so what you can do is you can use the tokens part of speech tag morphological analysis dependency label limit and shape to actually make matches so not just matching a sequence of characters but matching a sequence of linguistic features so think about this if you wanted to capture all instances of a proper noun followed by a verb you would not be able to do that with regex there's not a way to do it you can't give regex if this is a verb regex is just a string matching framework it's not a framework for actually identifying linguistic features using them and extracting them so this is where we can leverage all the power of spacey's earlier pipes the tagger the morphological analysis the the depth dilemma etc so the lemmatizer we can actually use all those things that have been have gone through the pipeline and the matcher can leverage those linguistic features and make some really cool uh allow us to make really cool patterns that can match really robust and complicated things and the final thing i'm going to talk about is right here the op this is the operator or quantifier it determines how often to match a token so there's a few different things you can use here there's the exclamation mark negate the pattern requiring it to match zero times so in this scenario the sequence would never occur there's the question mark make the pattern optional allowing it to match zero or one times require the pattern to match one or more times with the plus and the asterisk the thing on the shift eight allow the pattern to match zero or more times there's other things as well that you can do to make this matcher a bit more robust but for right now let's jump into the basics and see how we can really kind of take these and apply them in a real world question so what i'm going to do is i'm going to work with another data set or another piece of data that i've grabbed off of wikipedia and this is the the wikipedia article entry on martin luther king jr it's the opening opening few paragraphs let's print it off and just take a quick look and this is what it looks like you can go through and read it we're not too concerned about what it says right now we're concerned about trying to extract a very specific set of patterns what we're interested in grabbing are all proper nouns that's the task ahead of us somebody has asked us to take this text in extract all the proper nouns for me but we're going to do a lot more not just the proper nouns but we want to get multi-word tokens so we want to have martin luther king jr extracted as one token so one export so the other things that we want to have are these kind of structured in sequential order so find out where they appear and extract them based on their start token so let's go ahead and start trying to do some of these things right now scroll down here great so we've we need to create really a new nlp object now at this point so let's create a new one again we're just going to start working with the n core web sm model if you're working with a different model like the large or the the transformer you're gonna have more accurate results but for right now we're just trying to do this quickly for demonstration purposes so again just like before we're creating that with nlp.vocab and then we're going to create a pattern so this is the pattern that we're going to work with we want to find any occurrence of a pos part of speech that corresponds to proper noun that's the the way in which pos labels proper nouns is prop n and we should be able to with that extract all proper nouns so we can say matcher.add and we're going to say proper noun and that's going to be our pattern and then what we can do just like before we're going to create the doc object this is going to be nlp text and then we're going to say matches is equal to matcher doc so we're going to create the matches by passing that doc object uh into our matcher class and then we're gonna print off the length of the matches so how many matches were found and we're gonna save for match and matches and we're just gonna grab the first 10 because i've done this and there's a lot and you'll see why let's print off let's print off in this case match and then we're going to print off specifically what that text is remember the output is the lexeme followed by the start token and the end token which means we can go into the doc object and we can set up something like this we can say match 1 so index 1 which is the start token and match 2 which is the end token and that'll allow us to actually index what these words are and when we do this we can see all these printed out so this is the match the lexeme here which is going to be proper noun all the way down we've got the zero here which corresponds to the start token the end token and this is the the token that we extracted martin luther king jr michael king jr we've got a problem here right so the problem should be pretty obvious right now and the problem is that we have grabbed all proper nouns but these proper nouns are just individual tokens we haven't grabbed the multi-word tokens so how do we go about doing that well we can solve this problem by let's go ahead and just copy and paste all this from here and we're going to make one small adjustment here we're going to change this to op with a plus so what does that mean well let's pop back into our matcher under spacey and check it out so op remember as the operator or quantifier we're going to use the plus symbol so it's going to look for a proper noun that occurs one or more times so in theory right this should allow us to grab multi-word tokens it's going to look for a proper noun and grab as many as there are so anything that occurs one or more times if we run this though we see a problem we've gotten martin we got martin luther what we got luther what we got martin luther king luther king king martin luther king jr what what is going on here well you might already have figured it out it has done exactly what we told it to do it's grabbed all uh sequence of tokens that were proper nouns that occurred one or more times it just so happens some of these overlap so token that's dock 0 to 1 0 to 2 so you can see the problem here is it's grabbing all of these in any combination of them what we can do though is we can add an extra layer to this so let's again copy what we've just done because it was it was almost there it was good but it wasn't great we're going to do one new thing here when we add in the patterns we're going to pass in the keyword argument greedy we're going to say longest capital all capital letters here and if we execute that it's going to look for the longest token out of that mix and it's going to give that one make that one the only token that it extracts we notice that our length has changed from what was it up here 175 to 61. so this is much better however we should have recognized right now another problem what have we done wrong well what we've done wrong is these are all out of order in fact what happens is when you do this i i don't have evidence to support this but i believe it's right um what will always happen is the the greedy longest will result in all of your tokens being organized are all your matches being organized from longest to shortest so if we were to scroll down the list and look at maybe negative one uh negative let's do negative 10 on you'll see single word tokens and again this is me just guessing but i think based on what you've just seen that's a fairly good guess so let's go ahead and just cut so we can see what the output is here so how would you go about organizing these sequentially well this is where really kind of a sort comes in handy when you can pass a lambda to it let's go back and copy all this again because again we almost had this right here we're going to sort our matches though we can say matches dot sort this is going to take a keyword argument of key which is going to be equal to lambda and lambda is going to allow us to actually iterate over all this and find any instance where x occurs and we're going to say to sort by x1 so what this is it's a list of tuples and what we're using lambda for is we're going to say sort this whole list of tuples out but sort it by the first index in other words sort it by the start token when we execute that we've got everything now coming out as we would expect nor are these typos that exist we've got 0 to 4 6 to nine so we actually are extracting these things in sequential order as they appear in our text so that's how you can actually go through and sort the appearance of the of the matcher but what if our uh the person who kind of gave us this job they were happy with this but they came back and said okay that's cool but what we're really interested in what we really want to know is every instance where a proper noun of any length uh grab the multi-word token still but we want to know anytime that occurs after a verb so anytime this proper noun is followed by a verb so what we can do is we can add in okay okay we can do this we're going to have a comma here so the same pattern is going to be a sequence now it's not just going to be one thing we're going to say token one needs to be a proper noun and grab as many of those tokens as you can zero or not one to more times and then after those are done comma this is where the next thing has to occur pos so the part of speech needs to be a verb so the next thing that comes out needs to be a verb and we want that to be the case when we do this we can kind of go through and see the results so the first instance of this where a proper noun is proceeded by a verb comes in token 50 to 52 king advanced uh 258 director j edgar hoover considered now we're able to use those linguistic features that make spacey amazing and actually extract some vital information so we've been able to figure out where in this text a a proper noun is proceeded by a verb so you can already start to probably see the implications here and we can ex we can create very elaborate things with this we can use any of these as long of a sequence as you can imagine we're going to work with a different text and kind of demonstrate that it's a fun toy example i've got a halfway cleaned copy of alice in wonderland stored as a json file i'm going to load it in right now and then i'm going to just grab the first sentence from the first chapter and what we have here is the first sentence so here's our scenario somebody has asked us to grab all the quotation marks and try to identify the person described or the person described the person who's doing the speaking or the thinking in other words we want to be able to grab alice thought now i picked alice in wonderland because of the complexity of the text not complexity in the sense of the language used children's book but complexity and the syntax the syntax is highly inconsistent yes and not cs lewis um carol lewis c carroll was highly inconsistent in how he structured these kind of sequences of quotes and the other thing i chose to do as i left in one mistake here and that is this non-standardized quotation mark so remember when you need to do this things need to patch perfectly so we're going to replace this first things first is to create a cleaner text where we do text equals text dot replace and we're going to replace the instance of i believe it's that mark but let's just copy and paste it in to make sure we're going to replace that with a with a single quotation mark now we can print off text just to make sure that that was done correctly cool great it was it's now looking good remember whenever you're doing information extraction standardize the texts as much as possible things like quotation marks will always throw off your data now that we've got that let's go ahead and start trying to create a fairly robust pattern to try to grab all instances where there is a quotation mark thought something like this and then followed by another quotation mark so the first thing i'm going to try and do is i'm going to try to just capture all quotation marks in a text so let's go through and try to figure out how to do that right now so we're going to copy in a lot of the same things that we used up above but we're going to make some modifications to it let's go ahead and copy and paste all that we're going to completely change our pattern so let's get rid of this so what are we looking for well first of all the first thing that's going to occur in this pattern is this quotation mark so that's going to be a full text match which is an orth if you remember and we're going to have to use double quotation marks to add in that single quotation mark so that's what we grab first we're going to look for anything that is an orth and the next thing that's going to occur after that i think this is good to probably do this now on a line by line basis so we can keep this straight so the next thing that's going to occur is we're looking for anything in between so anything that is an alpha character we're gonna just grab it all so is alpha and then we need to say true but within this we need to specify how many times it occurs because if we say is true it's just going to look at the next token in this case and and then say that's the end that's it that's the pattern we've got to extract it but we want to grab not just and but and what is the use of a everything so we need to grab uh not only that but we say op so our operator again and if you said plus you would be right here we need to make sure that it's a plus sign so it's grabbing everything now in this scenario this is a common construct is when you have a injection here in the middle of the sentence so thought or said and it's the character doing it it's oftentimes got a a comma right here so we need to add in that kind of a feature so there could be is punct there could be a punch here and we're going to say that that is equal to true but that might not always be the case there might not always be one there so we're going to say op is equal to a star if we go back we'll see why if we go back to our op the star allowed the pattern to match zero or more times so in this scenario the punctuation may or may not be there so that's the next thing that occurs once we've got that the last thing that we need to match is the exact same thing that we had at the start is this orth up here and that's our sequence so this is going to look for anything that starts with a quotation mark has a series of alpha characters has a punctuation like a comma possibly and then closes the quotation marks if we execute this we we succeeded we got it we extracted both matches from that first sentence there are no other quotation marks in there but our task was not just to extract this information our task was also to match who is the speaker now we can do this in a few different ways and you're going to see why this is such a complicated problem in just a second so let's go ahead and do this how can we make this better well we're going to have this occur twice but in the middle we need to figure out when somebody is speaking so one of the things that we can do is we can make a list so let's make a list of limitized forms of our verbs so we're going to say let's call this speak underscore limits it's going to be equal to a list and the first thing we're going to say is think because we know that think is in there and say this is the limitized form of thought and said so what we can do now is after that occurs let's add in a new thing we're going to be able to now add in a new pattern that we're looking for it's not just a start of a query quotation mark not just the end of a quotation mark but also a sequence that'll be something like this so it's going to be a part of speech so it's going to be a verb that occurs first right and that's going to be a verb but more importantly it's going to be a lemma that is in what did i call these speak lemmas so let's break this down the next token needs to be a verb and it needs to have a limitized form that is contained within the speak limas list so if it's got that fantastic let's execute this and see what happens we should only have one hit cool we do so we've got that first hit and the second one hasn't appeared anymore because that second quotation mark wasn't proceeded by a verb let's go ahead and make some modifications that we can improve this a little bit because we want to know not just what what that person's doing we also need to know who the speaker is so let's grab it let's let's figure out who that speaker is so we can use part of speech again another feature here we know that it's going to be a proper noun because oftentimes proper nouns are doing the speaking sometimes it might not be sometimes it might be like the girl or the boy lower case but we're going to ignore those situations for just right now so we're looking for a proper noun remember proper nouns as we saw just a second ago could be multiple tokens so we're going to say op plus so it could be a sequence of tokens let's execute this now we've captured alice here as well so and is the use and what is the use of a book thought alice now we know who the speaker is but this is a partial quotation this is not the whole thing we need to grab the other quote how how will we ever do that well we've already solved that we can copy and paste all of this that we already have done right down here and now we've successfully extracted that entire quote so you might be thinking to yourself yeah we did it we can now extract quotation marks and we can even extract um extract you know any instance where there's a quote and somebody speaking not so fast let's try to iterate over this data so we're going to say for text in data 0 2 so we're going to iterate over the the first chapter and we're going to go ahead and let's do all of this doc is going to be equal to that sort that out and then again we're going to be printing out this information the same stuff i did before just now it's going to be iterating over the whole chapter and if we let this run we've got a serious serious problem and it doesn't actually grab us anything nothing has been grabbed successfully what is going on we've got a problem and that problem stems from the fact that our patterns and the problem is that we don't have our our text correctly we're being removing the uh that quotation mark that was the the problem up above so we're gonna add this bit of code in and we're gonna be able to fix it so now when we execute this we see that we've only grabbed one match now you might be thinking to yourself there's an issue here and there there is let's go ahead and print off the length of matches and we see that we've only grabbed one match and then we haven't grabbed anything else well what's the problem here are there are there no other instances of quotation marks in the rest of the first chapter and the answer is no that there are there absolutely are other quotation marks and other paragraphs from the first chapter the problem is is that our pattern is singular it's not multi-varied we need to add in additional ways in which a a text might be structured so let's go ahead and try and do this with some more patterns i'm going to go ahead and copy and paste these in from the textbook so you'll be able to actually see them at work and so what i've did i've done is i've added in more patterns pattern two and pattern three allow for instances like this well thought alice so an instance where there's a punctuation but there's no proceeding quotation after this and then which certainly said before an instance where there's a comma followed by that so we've been able to capture more variants and more ways in which quotation marks might exist followed by the speaker now this is where being a domain expert comes into play you'd have to kind of look through and see the different ways that louis c carroll structures quotation marks and write out patterns for capturing them i'm not going to go through and try to capture everything from alice in wonderland because that would take a good deal of time and it's not really in the best interest because it doesn't matter to me at all what i encourage you to do if this is something interesting to you is try to apply it to your own texts different authors structure quotation marks a little differently than what patterns that i've gotten written here are a good starting point but i would encourage you to start playing around with them a little bit more and what you can do is when you actually have this match extracted you know that the instance of a proper noun that occurs between these quotation marks or after one is probably going to be the person or thing that is doing the speaking or the thinking so that's kind of how the matcher works it allows for you to do these things these robust type data extractions without relying on entity ruler and remember you can use a lot of these same things with an entity ruler as well but we don't want this in this case we don't want things like this to be labeled as entities we want them to just be separate things that we can extract outside of the of the ins dot doc dot ends that's going to be where we conclude our chapter on on the on the matcher in the next section of this video we're going to be talking about custom components in spacey which allow for us to do some pretty cool things such as add in uh special functions that allow for us to uh kind of do different custom shapes uh permutations on our data with components that don't exist like an entity ruler would be a component components that don't exist within the spacey framework so add in custom things like an entity ruler that do very specific things to your data hello we're now moving into a more advanced aspect of the textbook specifically chapter 7 and that's working with custom components a good way to think about a custom component is something that you need to do to the doc object or the doc container that spacey can't do off the shelf you want to modify it at some point in the pipeline so i'm going to use a basic toy example that demonstrates the power of this let's look at this basic example that i've already loaded into memory it's uh two sentences that are in the doc object now and that's britain is a place mary is a doctor so let's do for ent and doc dot ins print off and dot text and dot label and we see what we'd expect britain is gpe a geopolitical entity mary is a person that's fantastic but i've just been told by somebody higher up that they want the model to never ever give anything as gpe or maybe they want any instance of gpe to be flagged as loc so all the different locations all have loc as a label or we just want to remove them entirely so i'm going to work with that latter example we need to create a custom pipe that removes all instances of gpe from the doc.ins container so how do we do that well we need to use a custom component we can do this very easily in spacey by saying from spacey.language import language capital l very important there capital l now that we've got that class loaded up let's start working with this what we need to do first is we need to use a flag so the at symbol and we need to say at language dot component and we need to give that component a name we're going to say in this case let's say remove gpe and now we need to create a function to do this so we're going to call this remove gpe i always kind of keep these as the same that's my personal preference and this is going to take one one one thing that's going to be the doc object so the doc object think about how it moves through the pipeline this component is another pipe and that pipeline it needs to receive the dock object and send off the dock object you could do a lot of other things it could print off entity found it could do really any number of things it could add stuff to the data coming out of the pipeline all we're concerned with right now is modifying the doc.ns so we can do something like this we can say original ends is equal to a list of the doc.ns so now remember we have to convert the ins from a generator into a list now what we can do is we can save for int and doc.ins if the end not label so if that label is equal to gpe then what we want to do is we want to just we just want to remove it so let's say original ins dot remove and we're going to remove the end remember it's now a list oop sorry i executed that too soon remember it's now a list so what we can do is we can go ahead now and convert those original ends back into doc dot ends by saying doc dot ins equals original ends and if we've done things correctly we can return the doc object and it will have all of those things removed so this is what we would call a custom component something that changes the doc object along the way in the pipeline but we need to add it to nlp so we can do nlp dot add pipe we want to make sure that it comes after the ner so we're just going to say add the pipe or move gpe corresponds to the component name and now let's go ahead and nlp.analyze pipes and you'll be able to see that it sits at the end of our pipeline right there remove gpe now comes time to see if it actually works so we're going to copy and paste our code from earlier up here let's go ahead and copy this and now we're going to save for end and doc dot ends print off end.text and dot label and we should see as we would expect just mary coming out our pipeline has successfully worked now as we're going to see when we move into regex you can do a lot of really really cool things with custom components i'm going to kind of save the the advanced features for i think i've got it scheduled for chapter here chapter 9 in our textbook this is just a very very basic example of how you can introduce a custom component to your spacey pipeline if you can do this you can do a lot more you can maybe change a different entity so they have different labels you can make it where gpes and locks all agree you can remove certain things you can have it print off place found person found you can do a lot so really the sky's the limit here but a lot of the times you're going to need to modify that doc object and this is how you do it with a custom pipe so that you don't have to write a bunch of code for a user outside of that nlp object that nlp object once you save it to disk by doing something like nlp nlp.2 disk data new n core web sm it's going to actually be able to go to the disk and be saved with everything but one thing that you should note is that the component that you have here is not automatically saved with your data so in order for your component to actually be saved with your data you need to store that outside of this entire script you need to save it as a library that can be given to the model when you go to package it that's beyond the scope of this video for right now in order for this to work in a different jupiter notebook if you were to try to use this this container this component has to actually be in the script when it comes time to package your model your pipeline and distribute it that's a different scenario in that scenario you're going to make sure that you've got a special my component.pi file with this bit of code in there so that so that spacing knows how to handle your particular data it's now time to move on to chapter eight of this textbook and this is where spacey gets really interesting you can start applying regular expressions into a spacing component like an entity ruler or a custom component as we're going to see in just a moment with chapter 9. i am not going to spend a good deal of time talking about regular expressions i could spend five hours talking about regex and what all it can do in the textbook i go over what you really need to know which is what regular expressions is which is as a way to do a really robust string pattern matching i talk about the strengths of it the weaknesses of it that's drawbacks how to implement it in python and how to really work with regex but this is a video series on spacey what i want to talk about is how to use regex with spacey and so let's move over to a jupyter notebook where we actually have this code to execute and play around with if we look here we have the same example that we saw before what my goal is is not to extract the whole phone number rather try to grab this sequence here and we do this with a regular expression pattern what this says is it tells it to look for a sequence of tokens or sequence of characters like this it's going to be three digits followed by a dash followed by four digits if i were to execute this whole code nothing is printed out does that mean that i failed to write good rejects no it does not at all it's failed for one very important reason and this is the whole reason why i have this chapter in here is that regex when it comes to pattern matching pattern matching only really works uh when it comes to regex for single tokens you can't use regex across multi-word tokens at least as of spacey 3.1 so what does that mean well it means that that dash right there in our phone number is causing all kinds of problems if we move down to our second example it's gonna be the exact same pattern a little different let me go ahead and move this over so you can see it a bit better it's going to be regex that looks like this where we just look for a sequence of five digits we execute that we find it just fine and the reason for that is because this does not have a dash so regex if you're familiar with it if you've worked with it it's very powerful you can do a lot of cool things when you're going to use this in python if you're using just the standard off the shelf components so the entity ruler the matcher you're going to be using this when you want to match regex to a single token so think about this if you're looking for a word that starts off with a capital d and you want to just grab all words that start with a capital d that would be an example of when you would want to use it in a standard off-the-shelf component but that's not all you can do in spacey you can use regex to actually capture multi-word tokens so capture things like mr deeds so any instance of mr period space name a sequence of proper nouns you can also use it to but yet in order to do that you have to actually understand how to add in a custom component for it and we're going to be seeing that in just a second as we move on to chapter 9 which is advanced regex if you're not familiar with regex at all take a few minutes read chapter eight i encourage you to do so because i go over in detail and i talk about how to actually engage in regex in python and its strengths and weaknesses what i want you to really focus on though and get away from get from all this is how to do some really complex multi-word token matching with regex remember you're going to want to use regular expressions when the pattern matching that you want to do is un independent of the the lima the pos or any of the linguistic features that space is going to use if you're working with linguistic features you have to use the spacey pattern pattern matching things like the morph the earth the lima things like that but if your sequence of strings is not dependent on that so you're looking for any instance of in this case we're going to talk about in just a second a a case where paul is followed by a capitalized letter and then a word break then you're going to want to use regular expressions because in this case this is independent of any linguistic features and regular expressions allows you to write much more robust patterns much more quickly if you know how to use it well and it allows you to do much more quick uh robust things within a custom component and that's going to be where we move to now now that we know a little bit about regex and how it can be implemented in python let's go ahead and also in spacey let's go ahead and try and see how we can get regex to actually find multi-word tokens for us within spacey using everything in the spacey framework so the first thing i'm going to do to kind of demonstrate all this is i'm going to import regex this comes standard with python and you can import it as re just that way import re and that's going to import regex i'm going to work from the textbook and work with this sample text so this is paul newman was an american actor but paul hollywood is a british tv tv host the name paul is quite common so it's going to be the text that we work with throughout this entire chapter now a regex pattern that i could write to capture all instances of things like paul newman and paul hollywood which is what my goal is could look something like this i could say r and make an r string here and say paul and then i'm going to grab everything that starts with a capital letter and then i grab everything until a word break and that's going to be a pattern that i can use in regex what this formula means is find any instance of paul proceeded by a in this case a capital letter until the actual word break so grab the the first name paul and then what we can make a presumption is going to be that individual's last name and the text a simple example but one that will demonstrate our our kind of purpose right now so how we can do this is we can create an object called matches and use regex dot find iter we can pass in the pattern and we can pass in the text so what this is going to do is it's going to use regex to try to find this pattern within this text and then what we can do is we can iterate over those matches so for match and matches we can grab and print off the match and we have something that looks like this what we're looking at here is what we would call a regex match object it's got a couple different components here it's got a span which tells us the start character and the end character and then it has a match and what this match means is the the actual text itself so the match here is paul newman and the match here is paul hollywood so we've been able to extract the two entities in the text that begin with paul and have a proper last name structured with a capital letter and we grabbed everything up until the word break that's great that's going to be what you need to know kind of going forward because what we're going to do now is we're going to implement this in a custom spacey pipe but first let's go through and write the code so that we can then easily kind of create the pipe afterwards so what we need to do is we need to import spacey and we also need to say from spacey.tokens import span and we're going to be importing a couple different things as we move forward because we're going to see that we're going to make a couple mistakes intentionally i'm going to show you how to kind of address these common mistakes that might surface in trying to do something like this so once we've imported those two things we can start actually writing out our code again we're going to stick with the exact same text and again we're going to stick with the exact same pattern that we've got stored in memory up above so we need to do now is we need to create a blank spacey object or sorry a blank spacey pipeline that we can kind of put all this information into and for right now what we're going to do is we're just going to kind of go through and look at these individual entities so again we're going to create the doc object which is going to be equal to nlp text and this is not going to be necessary for right now but i'm establishing a kind of a consistent workflow for us and you're going to see how we kind of take all this and implement it inside of a pipeline so we're going to say original ins is equal to list doc dot ends now in this scenario there's not going to be any entities because we don't have an nar or an entity ruler in our blank spacey pipeline what we're going to do next is we're going to create something called an nwt and and that's going to stand for multi-word token entity you can name this whatever you like this is just what i kind of stick to and then we're going to do and this is straight from the spacey documentation we're going to say from match and re.find it the same thing that we saw above pattern doc.text so what this is going to do is it's going to take that doc object look at it as raw text because remember the doc object is a container that doesn't actually have raw text in it until you actually call the dot text attribute and then our goal is for each of these things we're going to look and call in this span so we're going to say is start and the end is equal to match dot span so what we're doing here is we're going in and grabbing the span attribute and we're grabbing these two components the start and the end but we have a problem these are character spans remember the doc object works on a token level so we've got to kind of figure out a way to reverse engineer this almost to actually get this into a spacey form fortunately the doc object also has an attribute called character span so what we can do is we can say the span is equal to doc dot char span start and end so what this is going to do is it's going to print off essentially for us let's go ahead and do that it would print off for us where we were to actually have an entity here it would print off for us as we can see paul newman and paul hollywood so what we need to do now is we need to get this span into our entities so what we can do is instead of printing things off we can say if span is not none because in some instance instances this will be the case you're going to say nwt ins dot append you're going to append a tuple here span dot start span dot end span dot text so this is going to be the start the end and the text itself and once we've done that we've managed to get our multi-word tokens into a list that looks like this start end paul newman paul hollywood and notice that our span dot start is aligning not with a character span now it's rather aligning with a token span so what we've done is we've taken this character span here and been able to find out where they start and end within the the token sequence so we have 0 and 2. so paul newman won this was the zero index it goes up until the second index so it grabs index token zero and token one and we've done the same thing with paul hollywood now that we've got that data we can actually start to inject these entities into our original entities so let's go through and do that right now so we can do once we've got these things appended to this list we can start injecting them into our original entities so we can say for end in mwt ins what we want to do is we want to say the start the end and the name is equal to end because this is going to correspond to the tuple the start the end and the entity text now what we can do is we can say per in so this is going to be the individual end we're going to create a span object in spacey it's going to look like this so a capital s here remember we imported it right up here this is where we're going to be working with the span class and this is going to create for us a span object that we can now safely inject into the spacey doc.ins list so we can say doc start and label and this is going to be the label that we want to actually assign it and this is going to be person in this case because these are all people what we can do now is we can go through and say doc we can inject this into the original ends original ends dot append and we're going to append the per end which is going to be this span object and finally what we can say is doc dot ins is equal to original ins kind of like what we saw just a few moments ago and let's go ahead and print off we've got our entities right there were we to do this up here when we first kind of create the doc object you'll see nothing an empty list but now we've been able to do is inject these into the doc object the doc.ins attribute and we can say for ent and doc.ins just like everything else end.text and dot label and because we converted it into a span we were able to inject it into the entity attribute from the doc object kind of natively so that spacey can actually understand it so what can we do with this well one of the things that we could do is we can use the knowledge that we just acquired about custom components and build a custom component around all of this so how might we do that well let's go through and try it out the first thing that we need to do is we need to import our language class so remember from a few moments ago whenever you need to work with a custom component you need to say from spacey dot language import language with a capital l what we're going to do now is we're going to take the code that we just wrote and we're going to try to convert that into an actual custom pipe that can fit inside of our pipeline as kind of our own custom entity ruler if you will so what we're going to do now is we're going to call this language.com and we're going to call this let's call this paul ner something not too not too clever but kind of very descriptive we're going to call this paul ner and this is going to take that single doc object because remember this pipe needs to receive the doc object and do stuff to it so what we can do is we can take all this code that we just wrote from here down and paste it into our function and what we have is the ability now to implement this as a custom pipe we don't need to do this because we don't want to print things off but here we're going to return the doc object so we have now is a custom kind of entity ruler that uses regex across multiple tokens if you want to use regex in spacey across multiple tokens as of spacey 3.1 this is the only way to implement this so now we can take this pipe and we can actually add it to a blank custom model so let's make a new nlp call this nlp2 is equal to spacey.blank and we're going to create a blank english model nlp2 dot add pipe we're going to add in paul ner and now we see that we've actually created that successfully so we have one pipe kind of sitting in all of this now what we can do is we can go through and we need to probably add in our pattern as well here just for good practice because this should be stored somewhat adjacent i like to sometimes to keep it up here when i'm doing this but you can also keep it kind of inside of the function itself let's go ahead and just kind of save that and we're going to rerun this cool now what we can do is we can say doc 2 is equal to nlp2 we're going to go over that exact same text and we're going to print off our doc2.ins and we've now managed to implement that as a custom spacey pipe but we've got one big problem let's say just hypothetically we wanted to also kind of work in really a another kind of something into our actual pipeline we wanted this pipeline to sit on top of maybe an existing spacey model and for whatever reason we don't want paul hollywood to have that title we want it to have the title maybe we want to just kind of keep paul hollywood as a person but we also want to find maybe other cinema style entities so we're going to create another entity here instead of all this that's going to be something like let's go ahead and make a new a new container down here a new component down here we're going to just look for any instance of hollywood and we're going to call that the word the label of cinema so i want to demonstrate this because this is going to show you something that you are going to encounter when you try to implement this in the real world and i'm going to show you how to kind of address the problem that you're going to encounter so if we had a component that looked like this now it's going to look for just an instances of hollywood and let's call this holly cinema nar and change this here as well what we can do now is go ahead and load that up into memory so we've got this new component called cinema nar and just like before we're going to create nlp 3 now this is going to be spacey.load encore web sm and so what this is going to do is it's going to load up the spacey small model nlp3.add pipe and it's going to be the what did i call this again the cinema ner and if we were to go through and add that and create a new object called doc3 make that equal to nlp3 text we're going to get this error and this is a common error and if you google it you'll eventually find the right answer i'm just going to give it to you right now so what this is telling you is that there are spans that overlap that don't actually work because um one of the spans for cinema is hollywood and the small model is extracting not only that hollywood as a cinema but it's also extracting paul hollywood as part of a longer token so what's happened here is we're trying to assign a span to two of the same tokens and that doesn't work in spacey it'll break so what can you do well a common method of solving this issue is to work with the filter spans from the spacey.util let's go ahead and do this right now so we can say from spacey.util import filter spans what filter spans allows for you to do is to actually filter out all of the uh the spans that are being identified so what we can do is we can say at this stage before you get to the doc dot ends you can say filtered is equal to filter spans original ends so what does this do well what this does is it goes through and looks at all of the different start and end sections from all of your entities and if there isn't ever an instance where there is a an overlap of tokens so 8 to 10 and 9 to 10 premises and priority is going to be given to the longer token so we can do is we can set this now to filtered and it helps if you call it correctly filtered there we go we can set that to filtered instead of the original entities go ahead and save that we're going to add this again and we're going to do doc3 and we're going to say for int and doc3.ends print and dot text and end.label and if we've done this correctly we're not going to see the cinema label come out at all because paula hollywood is a longer token than just hollywood so what we've done is we've told spacey give primacy to the longer tokens and assign that label by filtering out the tokens you can prevent that air from ever surfacing but this is a very common thing that you're going to have to implement sometimes regex really is the easiest way to inject and do pattern matching in the entity okay so here's the scenario that we have before us in order to make this live this kind of live coding and applied spacey a little bit more interesting imagine in this scenario we have a client and the client is a stock broker somebody who's interested in investing and what they want to be able to do is look at news articles like those coming out of reuters and they want to find the news articles that are the most relevant to what they need to actually search for uh and read for the day so they want to find the ones that deal with their their personal stocks their holdings or maybe their um the specific index that they're actually interested in so what this client wants is a way to use spacey to automatically find all companies referenced within a text all stocks referenced within a text and all indexes referenced with the next text and maybe even some stock exchanges as well now on the the actual textbook if you go through to this chapter which is number 10 you're going to find all the kind of solutions laid out for you what i'm going to do throughout the next 30 or 40 minutes is kind of walk through how i might solve this problem at least on the surface this is going to be a rudimentary solution that demonstrates the power of spacey and how you can apply it in a very short period of time to do some pretty custom tasks such as financial analysis with that structured data that you've extracted you can then do any number of things what we're going to start off with though is importing spacey and importing pandas as pd if you're not familiar with pandas i've got a whole tutorial series on that on my channel python tutorials for digital humanities even though it has digital humanities in the title it's for kind of everyone but go through if you're not familiar with pandas and check that out you're not really going to need it for for this video here you're going to just need to understand that i'm using pandas to access and grab the data that i need from a couple csv files or comma separated value files that i have so the first thing that we need to do is we need to create what's known as a pandas data frame this is going to be equal to pd dot read csv and i actually have these stored in the data subfolder in the repo you have free access to these they're little tiny data sets that i cultivated pretty quickly they're not perfect but they're good enough for our purposes and we're going to use the separator keyword argument which is going to say to separate everything out by tab because these are tsv files tab separated value files and we have something that looks like this so what this stocks.tsv file is is it's all the symbols company names industry and market caps for i think it's around 5700 different stocks 5879 and so what we're going to use this for is as a way to start working into an entity ruler all these different symbols and company names what we want to do is we want to use these symbols to work into a model as a way to grab stocks that might be referenced and you can already probably start to see a problem with this capital a here we're going to get to that in a little bit and we want to grab all the company names so we can maybe create two different entity types from this data set stock and company so let's go through and make these intel lists so they're a little bit warm so let's go through and make these into lists so they're a little bit more manageable what we need to do is we need to create a list of symbols and that's going to be equal to df.symbol.2 list this is a great way to do it in pandas so you can kind of easily convert all these different columns into different lists that you can work with in python so companies is going to be equal to df dot company and name i believe the name was to list and just to demonstrate how this works let's print off symbols we're going to print up to 10. and you can kind of see we've managed to take these columns now and kind of build them into a simple python list so what can we do with that well one of the things that we can do is we can use that information to start cultivating an entity ruler but remember we want more things than just one or two kind of n i want to do ruler we don't just want stocks and we don't just want companies we also want things like indexes we're going to get to that in just a second though for right now let's try to work these two things into an entity ruler how might we go about doing that well as you might expect we're going to create a fairly simple entity ruler so we're going to say is nlp is going to be equal to spacey.blank we don't need a lot of fancy features here we're just going to have a blank model that's just going to host host and a single entity ruler that's going to be equal to nlp dot add underscore pipe and this is going to be entity ruler and now what we need to do is we need to come up with a way to go through all of these different symbols and add them in so we can say for symbol and symbols we want to say patterns dot append and we're going to make a an empty list of patterns up here and what we're going to append is that dictionary that you met when we talked about the entity ruler and i believe it was chapter 5 yeah and what this is going to have are two things label which is going to correspond to stock in this case and it's going to have a pattern and that's going to correspond to the pattern of the symbol so we're going to say symbol and what that lets us do is kind of go through and easily create and add these patterns in and we can do the same thing for company remember it's never a good idea to copy and paste in your code i am simply doing it for demonstration purposes right now this is not polish code by any stretch of the imagination and what we can do here now is we can do the same thing loop over the different companies and add each company in so what this is doing is it's creating a large list of different patterns that the entity ruler will use to then go through and as we create the a doc object over that sample reuters text i just showed you a second ago which we should probably just go ahead and pull up right now i'm going to copy and paste it straight from the textbook let's go ahead and execute that cell and we're going to add in this text here it is a little lengthy but it'll be all right and what we're going to do now is we're going to iterate over it create a doc object to iterate over all of that and our goal here is going to be able to say for end and doctor ends we want to have extracted all of these different entities so we can say print off end.text and dot label and let's see if we succeeded and we have to add in our patterns to our entity ruler so remember we can do this by saying ruler.add patterns patterns there we go that's what this error actually means and now when we do it we see that we've been able to extract uh apple as a company apple as a company nasdaq everything's looking pretty good but i notice really quickly that i wasn't actually able to extract apple as a stock and i've also got another problem i've extracted to the lowercase two as a stock as well why have these two things are as a company well it turns out in our data set we've got two two that is a company name that's almost always going to be a false positive and we know that that kind of thing might be better off worked into a machine learning model for right now though we're going to work under the presumption that anytime we encounter this kind of obscure company two as a lower case it's going to be a false positive i also have another problem i know for a fact that apple the stock is referenced within this text to make it a little easier let's see it right here and notice that it didn't find it to make this a little easier to display let's go ahead and display what we're looking at as displacy render so what we can do is we can use that displacy render that we met a little bit ago in this video so in order to import this if you remember we need to say from spacey import displacy and that's going to allow us to actually display our entities let's go ahead and put this however on a different cell just so we don't have to execute that every time and we're going to say a display see dot render and we're going to just render the doc object with a style that's equal to ent and we can see that we've got our text now popping out with our things labeled and you can see pretty quickly where we've made some mistakes where we need to incorporate some things into our entity ruler so for example if i'm scrolling through this is gray a little ugly we can change the colors that's beyond the scope of this video though but let's keep on going down we notice that we have apple dot io and yet this has been missed by our entity ruler why has this been missed well spacey as a tokenizer is seeing this as a single token so apple dot o the letter o capital letter o why is that well i i didn't know about this but apparently it does it has to deal with kind of the way in which uh stock indices are i think it's on the nasdaq kind of structure things so what can we do well we've got a couple different options here i know that these go through all different letters from a to z so we can either work with the string library or what we can do is we can import a a quick list that i've already written out of all the different letters of the alphabet and iterate through those with our ruler up here let's go ahead and add these letters right there and we can kind of iterate through those and whenever a stock kind of pops out with that kind of symbol plus any occurrence where it's got a period followed by a letter in those scenarios we want that to be flagged as a stock as well so what we can do is we can add in another thing right here add in another pattern and this is now going to be symbol plus we're going to add an f string right here a formatted string any occurrence of l we can set up a loop to say for l and letters do this and what this is going to allow us to do is to look for any instance where there is a symbol followed by a period followed by one of these capitalized letters that i just copied and pasted in so if we do that we can execute that cell and we can scroll down and we can now do the exact same thing that we just did a second ago and actually display this and now we're finding these stocks highlighted as stock so we're successfully getting these stocks and extracting them we've got a few different things that our client wants to also extract though they don't want to just extract companies and they don't want to just extract stock and they want to also extract stock exchanges and indexes but we have one other problem let me go ahead and get rid of this as the display mode and switch back to just our set of entities because it's a little easier to read for this example we've got another problem and we see we have a couple other stocks popping out we now know that kroger stock is here the nio dot in stock is in this text as well now we're starting to see a greater degree of specificity for right now i'm going to include two as a set of a stop a technical term would be like a stop board something that i don't want to be included into the model so i'm gonna make a list of stops and we're just gonna include two in that and we're gonna save for company and companies do all this if company not in stops we want this to occur what this means now is that our our pipeline while going through and having all of these different things all these different rules it's also going to have another rule that looks to see if there's a stop or if this company name is this stop and if it is then we want it to just kind of skip over and ignore it and if we go through we notice that now we've successfully eliminated this what we would presume to be a consistent false positive something that's going to come up again and again as a false positive great so we've been able to to get this where it works now pretty well what i also want to work into this model if you remember though are things like indexes fortunately i've also provided for us a list of all different indexes that are available from i believe it's like everything like the dow jones there's about 13 or 14 of them let's go ahead and import those up above and let's do that right here in this cell so it kind of goes in sequential order and it falls better with the textbook too so it's a new data frame object this is gonna be equal to pd.read csv we're gonna read in that data file that i've given us and that's gonna be the indexes.tsv with a separator that's equal to a tab let's see what that looks like and this is what it looks like so all these different indices now i know i'm going to have a problem right out of the gate and that's going to be that sometimes you're going to see things referenced as s p 500 i don't know a lot about finances but i know that you don't always see it as s p 500 index but i do think that these index symbols are also going to be useful so like i did before i'm going to convert these things into a list so it's a little easier for me to work with in a for loop and i'm going to say indexes is equal to df2 dot index name so grabbing that column to list and index symbols is equal to df2 dot index symbol dot to list and both of these are going to be different and uh they're both going to have the same exact uh entity label which is going to be a an index and let's let's go ahead and iterate over these and add them in as well so i'm going to go ahead and do that right now for index and indexes we want this label to be index we want this to be index here so that's going to allow us to kind of go through and grab all of those and we want to do the same thing with index symbols keep these a little separated here index symbols and that allows for us to do that and let's go ahead and without making any adjustments let's see let's see how this does with these new patterns that we've added in and because we've already got this text loaded into memory i'm going to go ahead and put this right here doc is going to be equal to nlp text for int and doc dot ends print off and dot text and dot label and we can kind of go through and we're actually now able to extract some indexes and i believe when i was looking at this text really quickly though i noticed that there was one instance at least where we had not only the index referenced but also uh a name like s p 500 right here s p 500 notice that it isn't found because it doesn't have the name index after it and notice also that none of our our symbols are being found because they all seem to be preceded by a dot so in this case a dot j a d j i and so that's something else that i have to work into this model and the list i gave the data set that's not there so i need to collect a list of these different names and work those into an entity ruler as well but for right now let's ignore that and focus on including this s p 500 so how can i get the s p 500 in there from the list i already gave it well what i can do is i can say okay so under these indices not only do i want to add that specific pattern let's go ahead and break these things up into different words and so i'm going to have the words is equal to index dot split and then i'm going to make a presumption that the the first two words so the s p 500 the s p 400 are sometimes going to be referenced by themselves so what i want to do is i want to work that into the model as well and i want to say we're going to say patterns.append copy this as well we can say something like uh dot join words up until the second index and let's go ahead and work that into our model in our patterns our pipeline and print off our nlp again and you'll find that we've now been able to capture things like snp 500 that aren't proceeded by the word index and we see that we in fact have s p 500 is now popping out time and again that's fantastic i'm pretty happy with that now we're we're getting a deeper sense of what this text is about without actually having to read it we know that it's going to deal heavily with apple and we know that it's also going to tangentially deal with some of these other things as well but i also want to include into this into this pipeline the ability for the entity ruler to not just find these things but i also wanted to be able to find different stock exchanges so i've got a list i cultivated for different stock exchanges which are things like nyse things like that so i can say ds 3 is going to be equal to pd.readcsv backslash stock exchanges dot tsv and then the separator is going to be again a tab and let's take a look at what this looks like exchanges there we go there we are and we have something that looks like this a pretty a pretty large csv file tsv file sorry that's got a bunch of different rows the ones i'm most interested in well there's a couple actually i'm interested in specifically the google prefix and this description the description has the actual name and the prefix has this really nice abbreviation that i've seen pop out a few different times such as nasdaq here if we keep on going down we would see different things as well nyse these are kind of different stock exchanges so let's pop back down here and let's go ahead and convert those two things into individual lists as well so we're going to say exchanges it's going to be equal to df3 dot iso dot to list and then i'm also going to grab df3 dot uh sorry google and i have to do this as a dictionary because it's uh the way the data set's cultivated it's got a space in the middle this is a common problem that you run into and then i also want to know grab all of these exchanges as well so i'm going to say also on top of that df3 dot description dot 2 list so i'm making a large list exchanges and i get this here because it says google prefix isn't an actual thing and in fact it's prefix with an i and now we actually are able to get all these uh things extracted so what i want to do now is i want to work all these different symbols and descriptions into into the model as well or into the pipeline as well so i can say for for e and exchanges i want to say patterns dot append and i want to do a label that's going to be let's do stock exchange and then the next thing i want to do is a pattern and that's going to be equal to in this case e as we're going to see this is not adequate enough we need to do a few different things to really kind of work this out but it's going to be good enough to at least get started and it's going to take it just a second and the main thing that's happening right now are these different for loops so if we keep on going down we now see that we were able to extract the nyse stock exchange so we've not only been able to work into a pipeline in a very short order maybe about 20 30 minutes we've been able to work into a pipeline all of these different things that are coming out we do however see a couple problems and this is where i'm going to leave it though because you've got the basic mechanics down now comes time for you being a domain expert to work out and come up with rules to solve some of these problems nasdaq is not a company so there's a problem with the data set or nasdaq is listed as a company name in one of the data sets uh we need to work that out where nasdaq is never referenced as a company we have the s p and uh is now being coming out correctly as s p 500 there might be instances where just s p is referenced which i think in that context would probably be the s p 500 but nevertheless we've been able to actually extract these things sometimes the s these the dow jones industrial average average might just be referenced to dow jones so this index might just be these first two words i know that's a common occurrence we've also seen that we weren't able to extract some of those things that were a period followed by a symbol that referenced the actual index itself nevertheless this is a really good starting point and you can see how just in a few minutes you're able to generate this thing that can extract information from unstructured text at the end of the day like i said in the introduction to this entire video that's one of the essential tasks of nlp designing this and implementing it is pretty quick and easy perfecting it is where the time really is to get this financial analysis entity rule working really well where it has almost no false positives and almost never misses a true a true positive it would take maybe a few more hours of just some kind of working and eventually there are certain things you might find that would work better in a machine learning model nevertheless you can see the degree to which rules based approaches in spacey can really accomplish some pretty robust tasks with minimal minimal amount of code so long as you have access to or have already cultivated the data sets required thank you so much for watching this video series on spacey an introduction to basic concepts of natural language processing linguistic annotations in spacey vectors pipelines and kind of rules based spacey if you've enjoyed this video please like and subscribe down below and if you've also found this video useful consider joining me on my channel python tutorials for digital humanities if you have liked this and found this video useful i'm envisioning a second part to this video where i go over the machine learning aspects of spacey if you're interested in that let me know in the comments down below and i'll make a second video that corresponds to this one thank you for watching and have a great day

Original Description

In this spaCy tutorial, you will learn all about natural language processing and how to apply it to real-world problems using the Python spaCy library. 💻 Course website with code: http://spacy.pythonhumanities.com/ ✏️ Course developed by Dr. William Mattingly. Check out his channel: https://www.youtube.com/pythontutorialsfordigitalhumanities ❤️ Try interactive Python courses we love, right in your browser: https://scrimba.com/freeCodeCamp-Python (Made possible by a grant from our friends at Scrimba) ⭐️ Course Contents ⭐️ ⌨️ (0:00:00) Course Introduction ⌨️ (0:03:56) Intro to NLP ⌨️ (0:11:53) How to Install spaCy ⌨️ (0:17:33) SpaCy Containers ⌨️ (0:21:36) Linguistic Annotations ⌨️ (0:45:03) Named Entity Recognition ⌨️ (0:50:08) Word Vectors ⌨️ (1:05:22) Pipelines ⌨️ (1:16:44) EntityRuler ⌨️ (1:35:44) Matcher ⌨️ (2:09:38) Custom Components ⌨️ (2:16:46) RegEx (Basics) ⌨️ (2:19:59) RegEx (Multi-Word Tokens) ⌨️ (2:38:23) Applied SpaCy Financial NER 🎉 Thanks to our Champion and Sponsor supporters: 👾 Wong Voon jinq 👾 hexploitation 👾 Katia Moran 👾 BlckPhantom 👾 Nick Raker 👾 Otis Morgan 👾 DeezMaster 👾 AppWrite -- Learn to code for free and get a developer job: https://www.freecodecamp.org Read hundreds of articles on programming: https://freecodecamp.org/news

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from freeCodeCamp.org · freeCodeCamp.org · 0 of 60

← Previous Next →

React: Production Server Setup Part 2 - Live Coding with Jesse

React: Production Server Setup Part 2 - Live Coding with Jesse

freeCodeCamp.org

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

freeCodeCamp.org

Browser history tutorial - Beau teaches JavaScript

Browser history tutorial - Beau teaches JavaScript

freeCodeCamp.org

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

freeCodeCamp.org

React: Parameterized Routing with Next.js - Live Coding with Jesse

React: Parameterized Routing with Next.js - Live Coding with Jesse

freeCodeCamp.org

React: Dealing with jQuery Issues - Live Coding with Jesse

React: Dealing with jQuery Issues - Live Coding with Jesse

freeCodeCamp.org

setInterval and setTimeout: timing events - Beau teaches JavaScript

setInterval and setTimeout: timing events - Beau teaches JavaScript

freeCodeCamp.org

Browser and Device Testing - Live Coding with Jesse

Browser and Device Testing - Live Coding with Jesse

freeCodeCamp.org

Last Minute Updates - Live Coding with Jesse

Last Minute Updates - Live Coding with Jesse

freeCodeCamp.org

Post Launch Updates - Live Coding with Jesse

Post Launch Updates - Live Coding with Jesse

freeCodeCamp.org

React: Setting Up Google Analytics - Live Coding with Jesse

React: Setting Up Google Analytics - Live Coding with Jesse

freeCodeCamp.org

React: Masonry Layout - Live Coding with Jesse

React: Masonry Layout - Live Coding with Jesse

freeCodeCamp.org

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

freeCodeCamp.org

try, catch, finally, throw - error handling in JavaScript

try, catch, finally, throw - error handling in JavaScript

freeCodeCamp.org

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

freeCodeCamp.org

Graphs: breadth-first search - Beau teaches JavaScript

Graphs: breadth-first search - Beau teaches JavaScript

freeCodeCamp.org

React: Masonry Layout Part 2 - Live Coding with Jesse

React: Masonry Layout Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: WordPress API Live Search - Live Coding with Jesse

React: WordPress API Live Search - Live Coding with Jesse

freeCodeCamp.org

Creating WordPress Custom Post Types - Live Coding With Jesse

Creating WordPress Custom Post Types - Live Coding With Jesse

freeCodeCamp.org

Dates - Beau teaches JavaScript

Dates - Beau teaches JavaScript

freeCodeCamp.org

Miscellaneous Front End Updates - Live Coding with Jesse

Miscellaneous Front End Updates - Live Coding with Jesse

freeCodeCamp.org

Merging a Pull Request from GitHub - Live Coding with Jesse

Merging a Pull Request from GitHub - Live Coding with Jesse

freeCodeCamp.org

React + Prettier + Standard JS - Live Coding with Jesse

React + Prettier + Standard JS - Live Coding with Jesse

freeCodeCamp.org

React: Sortable Responsive Table - Live Coding with Jesse

React: Sortable Responsive Table - Live Coding with Jesse

freeCodeCamp.org

Geolocation Sorting by Distance - Live Coding with Jesse

Geolocation Sorting by Distance - Live Coding with Jesse

freeCodeCamp.org

Tradeoff Matrix - Agile Software Development

Tradeoff Matrix - Agile Software Development

freeCodeCamp.org

The Definition of Ready - Agile Software Development

The Definition of Ready - Agile Software Development

freeCodeCamp.org

Getting first React job without experience - Ask Preethi

Getting first React job without experience - Ask Preethi

freeCodeCamp.org

React: Google Analytics Click Tracking - Live Coding with Jesse

React: Google Analytics Click Tracking - Live Coding with Jesse

freeCodeCamp.org

Submitting a PR to an Open Source Project - Live Coding with Jesse

Submitting a PR to an Open Source Project - Live Coding with Jesse

freeCodeCamp.org

Should I go back to school to get CS degree? - Ask Preethi

Should I go back to school to get CS degree? - Ask Preethi

freeCodeCamp.org

Hero Section CSS Changes - Live Coding with Jesse

Hero Section CSS Changes - Live Coding with Jesse

freeCodeCamp.org

Working Agreement - Agile Software Development

Working Agreement - Agile Software Development

freeCodeCamp.org

A day at Pennybox with Co-Founder Reji Eapen

A day at Pennybox with Co-Founder Reji Eapen

freeCodeCamp.org

React: Sorting and Filtering Data - Live Coding with Jesse

React: Sorting and Filtering Data - Live Coding with Jesse

freeCodeCamp.org

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: Building a New UI - Live Coding with Jesse

React: Building a New UI - Live Coding with Jesse

freeCodeCamp.org

Definition of Done - Agile Software Development

Definition of Done - Agile Software Development

freeCodeCamp.org

Getting started with jQuery (tutorial) - Beau teaches JavaScript

Getting started with jQuery (tutorial) - Beau teaches JavaScript

freeCodeCamp.org

Making a React Blog with WordPress Content - Live Coding with Jesse

Making a React Blog with WordPress Content - Live Coding with Jesse

freeCodeCamp.org

React, NextJS, CSS - Live Coding with Jesse

React, NextJS, CSS - Live Coding with Jesse

freeCodeCamp.org

jQuery events - Beau teaches JavaScript

jQuery events - Beau teaches JavaScript

freeCodeCamp.org

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

freeCodeCamp.org

React: Working with API Data - Live Coding with Jesse

React: Working with API Data - Live Coding with Jesse

freeCodeCamp.org

React: Refactoring Components - Live Streaming with Jesse

React: Refactoring Components - Live Streaming with Jesse

freeCodeCamp.org

jQuery effects - Beau teaches JavaScript

jQuery effects - Beau teaches JavaScript

freeCodeCamp.org

More React Refactoring - Live Coding with Jesse

More React Refactoring - Live Coding with Jesse

freeCodeCamp.org

animate in jQuery - Beau teaches JavaScript

animate in jQuery - Beau teaches JavaScript

freeCodeCamp.org

"Finishing" My React Site - Live Coding with Jesse

"Finishing" My React Site - Live Coding with Jesse

freeCodeCamp.org

Starting a New React Project (P2D1) - Live Coding with Jesse

Starting a New React Project (P2D1) - Live Coding with Jesse

freeCodeCamp.org

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

freeCodeCamp.org

The Agile Manifesto - Agile Software Development

The Agile Manifesto - Agile Software Development

freeCodeCamp.org

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 3 - Live Coding with Jesse

React Project 2 Day 3 - Live Coding with Jesse

freeCodeCamp.org

The INVEST approach to product backlog items

The INVEST approach to product backlog items

freeCodeCamp.org

React Project 2 Day 4 - Live Coding with Jesse

React Project 2 Day 4 - Live Coding with Jesse

freeCodeCamp.org

Chickens and Pigs - Agile Software Development

Chickens and Pigs - Agile Software Development

freeCodeCamp.org

React Project 2 Day 5 - Live Coding with Jesse

React Project 2 Day 5 - Live Coding with Jesse

freeCodeCamp.org

jQuery: add and remove DOM elements - Beau teaches JavaScript

jQuery: add and remove DOM elements - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 6 - Live Coding with Jesse

React Project 2 Day 6 - Live Coding with Jesse

freeCodeCamp.org

This course teaches beginners how to use spaCy and Python for Natural Language Processing, covering topics such as linguistic annotations, named entity recognition, and pipelines. By the end of the course, students will be able to apply NLP to real-world problems and use spaCy for various NLP tasks.

Key Takeaways

Install spaCy
Understand linguistic annotations
Implement named entity recognition
Use word vectors
Create pipelines
Use EntityRuler and Matcher
Create custom components
Apply RegEx for tokenization

💡 spaCy is a powerful library for NLP tasks, and understanding its components and how to use them is crucial for effective NLP applications.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)