Chapter 2 Live Session with Sylvain

HuggingFace · Intermediate ·📰 AI News & Updates ·5y ago

Skills: LLM Foundations90%LLM Engineering80%Prompt Craft80%

Key Takeaways

The video covers the use of the transformers library for sentiment analysis, sequence classification, and tokenization, with a focus on the pipeline object, AutoModel and AutoTokenizer classes, and model configuration. It also discusses the importance of post-processing steps, such as converting logits to probabilities, and the use of special tokens and attention masks in tokenization.

Full Transcript

welcome to the live session where we'll go over chapter two of the akinsis course i'm joined by lewis with on the chat and we're going to answer all your questions quicker than i will and don't hesitate to ask all your questions because i'm going to read them a lot and answer them on the live stream that's like the main advantage of following this live stream instead of just watching the course by yourself um so over in inside this chapter 2 we'll look at the pipeline object that we used at length during chapter 1 on all nlp tasks and we'll see exactly how it works we'll see how it loads the model uh how it reprocesses the inputs with a tokenizer and then how it post processes those outputs to get the predictions and probabilities that we got during chapter one um so we'll watch a few videos i'll answer all the questions that you have and we'll do more live coding than in chapter one because um [Music] because chapter one was just a general introduction and there is a lot of more code in chapter two so as an introduction as you may know again face is mainly known for its transformers library which is a library containing a lot of transformers models and it provides an easy api to download pre-trained model and to use the different architectures i think there are more than 60 architectures now available in the library and it all exposes a unified api but inside and provides you with either a torch module or a tensorflow keras model that you can use by yourself or that you can train with the api that the library also provides and the the the goal is so ease of use flexibility and simplicity and the library doesn't contain any abstraction at all it's not a library composed of building blocks every model every one of the 60 architectures i was talking about is completely defined in its all in its own modeling files so we can have a quick look for instance at the modeling bert file which contains all the code of the birth model inside the library as if you look at the import you see like there are just torch imports and then some internal classes that this uses but which are sharp between all models which are mainly uh the output types that we use we'll see exactly what those outputs are a little bit later when we code but there is no other imports there is not like an attention block that will reuse the quest model the attention block for the broad model is defined inside this modeling file so we have the burton buildings you have the bird self attention etc etc and the idea is that if you want to play around with the model and change a line of code inside the model you won't have to learn 50 different files and it's not subclassing something that's a classic something but super classic something everything oh i'm guessing i'm just realizing i'm hiding a bit it's okay i guess i'm just going to scroll faster you have everything in that modeling bird file and you can play around and modify everything you want if you want to experiment with it and that's really one of the strengths of the transformers library one of the features that our users have said they like a lot which is why i wanted to show it to you briefly and um so yeah we'll see how the pipeline api loads that birth model it was actually a distilbert model that we used in chapter one and the corresponding tokenizer and how to do everything that this pipeline function was doing by hand so that you can tweak any of the steps if you need to on your own tasks so let's begin with the first section and the first video we're gonna watch this introductory video that's gonna present what's happening with being behind the pipeline i'm just not gonna stream them from youtube because that's making the live stream lag a lot i'm just i have all the videos locally so if you give me just one minute i'm gonna extract it and play it from my computer [Music] what happens inside the pipeline function in this video we'll look at what actually happens when we use the pipeline function of the transformers library more specifically we'll look at the sentiment analysis pipeline and how it went from the two following sentences to the positive and negative labels with our respective scores as we've seen in the pipeline presentation there are three stages in the pipeline first we convert the row text to numbers the model can make sense of using a tokenizer then those numbers go through the model which outputs the kits finally the post-processing steps transform those delegates into labels and scores let's look in details at those three steps and how to replicate them using the transformers library beginning with the first stage tokenization the tokenization process has several steps first the text is split into small chunks called tokens they can be words part of words or punctuation symbols then the tokenizer will add some special tokens if the model expects that here the model used expect a cls token at the beginning and a step token at the end of the sentence to classify lastly the tokenizer matches each token to its unique id in the vocabulary of the pre-trained model to load such a tokenizer the transformers library provides the auto-tokenizer api the most important method of this class is from pre-trained which will download and cache the configuration and the vocabulary associated to a given checkpoint here the checkpoint used by default for the sentiment analyzes pipeline is distilled based on case film tuned ss2 english which is a bit of a mouthful we instantiate a tokenizer associated with a checkpoint then feed it to the two sentences since the two sentences are not of the same size we'll need to pad the shortest one to be able to build an array this is done by the tokenizer with the option padding equal to with truncation equal true we ensure that any sentence longer than the maximum the model can handle is truncated lastly the return sensors option tells the tokenizer to return the pytorch tensor looking at the result we see we have a dictionary with two keys input ids contain the ids of both sentences with zero where the padding is applied the second key attention mask indicates where padding has been applied so the model does not pay attention to it this is all what is inside the tokenization step now let's have a look at the second step the model as for the tokenizer there is an auto model api with a form retrain method it will download and cache the configuration of the module as well as the pre-trained weight however the auto model api will only instantiate the body of the model that is the part of the model that is left once the protraining head is removed it will output a high dimensional tensor that is a representation of the sentences passed but which is not directly useful for a classification problem here the tensor has two sentences each of 16 tokens and the last dimension is the indent size of our model 768. to get an output link to our classification problem we need to use the auto model for sequence classification class it works exactly as the auto model class except that it will build a model with a classification head there is one auto class for each common nlp task in the transformers library here after giving our model the two sentences we get a tensor of size two by two one result for each sentence and for each possible label those outputs are not probabilities yet we can see they don't sum to one this is because each model of the transformers library returns logins to make sense of those logits we need to dig into the third and last step of the pipeline post processing to convert logits into probabilities we need to apply a softmax layers to them as we can see this transforms them into positive number that's them up to one the last step is to know which of those correspond to the positive or the negative level this is given by the id to label field of the model config the first probabilities index 0 correspond to the negative label and the seconds index y correspond to the positive level this is how our classifier built with the pipeline function picked those labels and completed those scores now that you know how each step works you can easily trick them to your needs and i am back and i think that's something wrong with as a webcam let me just double check um okay so somehow my head disappeared i don't know why exactly if you can see it feel free to say it in the chat [Music] in the meantime there is a question uh what does the triple that signify in from the activations import act to function in the py files shown uh very good question so this is standard python if you are building a package in python and you're trying to import things you have several levels so it's because of the structure of the transformer repo let me try to pull it back and quit my vs code but here we are so you have the transformers uh folder and then inside the transformer models you have a models subfolder and then a bird subfolder and then the modeling file is here and since we've organized the code that way to avoid having all the files directly in the transformers folder because as i said we have like 60 different architectures so that would be a lot of files they are organized like this and when you're trying to import when you say from dot dot dot it's to go back uh inside the structure so from that uh import blah blah blah would be in the bird folder that that is in the models fold folder and then the dot that that gets back to the transformers folder so it's just a way to come back to the root of the of the directory um i've done i didn't see any other questions but don't hesitate to ask at any time in the chat your questions and our own server as best as i can and as i expected my webcam is not showing anymore and i have no idea why because i just shoot the video i know it's not working let me just write something here we are sorry about that i had to just shut it down and restart it so uh we're good to continue uh behind uh our expression behind the pipeline function and so we'll we'll just take a little bit of the code that we just saw in the video i'm not going to show it from the section side but remember that in most sections in all the sections that i've code you have an open and collab button at the top like this which i opened a little bit earlier just to execute the first cell which is installing everything and can take a bit of time and then the second cell which is downloaded the model uh i executed it already so that we have the result instantaneously here and we are ready to look at the rest of the notebook so like in the video let me move myself on the other side of the screen because the code is mostly on the left so like in the video we'll look exactly at the code that is executed when we try to use the sentiment analysis pipeline on two sentences uh like that and see how we get to the results on those labels so as we've seen in the video the first step of the pre-processing is done by by a tokenizer so we'll we'll look into uh the tokenizers in detail a little bit further ahead but uh for not just you just need to know that the tokenizer takes the input text so those two sentences i've been waiting for the early phase course my whole life and i ate it so much it's gonna take those two sentences and convert that them into numbers because the model doesn't understand texts it understands numbers so basically behind the scenes it's going to split that text into small chunks that we call token and so on tokenizer and then associate each of us token to a unique id which is the numbers that we're going to see to load the tokenizer we need to know the identifier of the tokenizer so here this is the identifier of the of the model that is used by sentiment analysis pipeline by default and then we just call autotokenizer.form pretrained and it's gonna download uh if you if the file's already downloaded because i executed this first but if it's the first time you're executing this it's gonna download the files of the tokenizer and in particular the vocabulary which contains the mapping token to unique id and instantiate it and once you have that object available you can feed it your input directly like this so raw input inside tokenizer and we'll exactly explain what force padding and truncation mean a little bit further ahead and we tell it to return 10 source and since we're using pi torch here we tell it to restart a pi torch transfer with this pt you can also say tf of tensorflow tensors np for numpy arrays or flags for flax which is i guess for flex it's also done by the arrays and if we execute it we can see that we get an output which is a dictionary with input ids and attention mask we'll explain what the attention mask means a little bit forward in the course in this live session sorry and the input ids are the unique numbers i was talking about so it's converted that text into small chunks of that are called tokens and each of the stockings have been associated to unique number and once we have that we can use the model on this input just if we want to have a look back why are you annoying me um if we want to have a look back at how those ids correspond to the the text that we had at the beginning we can use the decode method method of our tokenizer so if i type tokenizer.decode [Music] and then i'm take my inputs grab the key input ids [Music] tick tick like this so this is a tensor so i'm going to convert it to a list and take this is a list of lists because i have two uh the two sentences in my sentence in my tensor sorry so i'm just taking the first one for instance and if i executed that you execute that sorry i can see my original text which has been a bit pre-processed there is no capital anymore for the eye for instance and we can also see that the tokenizer added something at the beginning and something at the end this is perfectly logical as the tokenizer is adding those um the tokenizer is adding those tokens uh because the model expects them so i'm just gonna pause here for questions before i go into the model part of the code what does auto in auto tokenizer mean and what type of tokenizer is used so the auto for auto tokenizer mean as in the auto technology class means that you can load any tokenizer corresponding to any architecture using that api so for instance here our model is a distilbert model so the tokenizer is going to be a distilled tokenizer you can double check that by just adding a console if i type tokenizer and print the output it's going to tell me it's a pretend tokenizer fast which is not super useful [Music] but the representation is not super useful but the type should be a distilled tokenizer fast um and if i had used a bird checkpoint i would have a bird tokenizer fast if i had used i don't know a bart checkpoint i would have a bar tokenizer first etc etc so the auto in auto tokenizer means that that class is going to pick the right subclass of tokenizer so when corresponding to the model used by your checkpoint automatically so your code here tokenizer equals autotokenizer.from pre-trained checkpoint is gonna work from for any checkpoint on the on the model up whatever the class of your model as long as it's a class of model that's been implemented in transformers it's going to work on it uh whereas if you were using here distilbert organizer you would have to change the class used if you change the type of checkpoint for instance if you use the bert model you would have to change it to burt organizer if you were using a gpt2 checkpoint you would need to change it to gpt2 tokenizer etc etc um the second question was will i'm be broken into i and am and the answer of to that is uh yes we'll see a little bit i'm just going to pose an answer more completely a little bit later when we see exactly the different type of tokenizers and what is the difference between a fast organizer and a standard tokenizer very good question so we usually have 4-h model 2 tokenizers one that is called slow or standard and the other that is called fast the fast organizer is backed by the looking face tokenizers library which is not written in python but in rust and which in turn because you may have known that python is a slow language and so if you do the world tokenization in pure python it can be a bit slow when you have lots lots of text whereas the tokenizer fast backed by rust is going to be extremely fast so the main difference uh that's the main difference between the two of them if you are just processing one text you won't see any difference but if you're processing ten thousands of text at the same time as the token tokenizer fast is going to be much faster than the python tokenizer and that's also a question we have for the moment so i'm going to continue and look at the model so the same way we have a new tokenizer api we have an auto model api and again the auto in auto model api means that the this class is going to pick the right subclass bert model gpd2 model distilbert model etc depending on the checkpoint that it receives so here since it's a distributor checkpoint it's gonna output a distilled model i can just show you here by typing model and it should have a nice wrap we can see here this steel bird model um i'm gonna remove that cell because it's a bit annoying thank you for not um so that model is coming with uh again uh we didn't download any file here because when we executed the pipeline instruction at the very beginning um we downloaded everything we needed so the model is already cached that's why we don't see any download here and we get a warning which i'm going to explain just after this cell which is because this auto model class is going to give us the base pre-trained model and this space pre-trained model doesn't output a classification of our sentence between positive and negative it outputs the hidden state of the pre-trained model which is of dimension 768. so that's why here we have a warning because that model of the model is missing a classification head and as the warning was saying some weights of the checkpoint were not used when initializing the model specifically classifier bias let's fire weights etc extra which has which are all the weights of the classifier head um so auto model is something that can be useful if you just want the tensor of hidden features outputted by your pretend model but here we want to classify our sentences between positive and negative so we need a model with a classification head and that's given to us with it by the auto model for sequence classification class and this one when i execute the cell is not going to output any warning because all the weights are going to be used and there is not going to be anything problematic when you dig the checkpoint and if we give our input to that model and look at the shape and we can see that it's going to be a transfer of size 2 by 2. one little comment about the output is that the output of transformer models so that was the thing that were imported at the beginning of our modeling file if you remember like we had a lot of import from the the dot model output module so those outputs are a bit of a hybrid between the name tuple and the dictionary so you can access everything either by doing dots like this so output.legit you can also access things by asking with a key ah so outputs and then we ask for the logits key like it would if like a dictionary which works as well [Music] um if we just ask for the base representation we can see that sequence classifier outputs which contain load sheets and this tensor so here it contains only one thing but most of the transformer model can return lots of things as outputs for instance if i added labels here it would return a loss we could also ask the model to return all the hidden states or all the attentions results and in which case our output is becoming a bit crowded which is why it's organized as this special class that behaves like addict unlike an interval so those legits because those are the names of what we get are numbers which appear a bit random they don't really look like probabilities uh and we will need to do one last step of purpose of post-processing as we saw in the video to convert them uh into probabilities that is apply the softmax and if we just import from torture softmax function we play it we can see that now we get the exact same score that we had at the beginning so for instance for the second sentence like we have at 0.99 so 99.95 percent and if we look back at the result of the pipeline we can see that we had 99.945 here so the exact same scores and to know which which one was the negative and which one to the positive level the pipeline is using that field from the model configuration so for each model the configuration file associated to it is accessible via the config attribute and the id to label field contains the correspondence between integers and labels so let's see if we have any questions first question is is there any reason to use a standard python tokenizer uh i work at a gaming phase so i'm a little bit biased and i'm gonna say no you have to using the fast organizer is always gonna be better so it's gonna be even if you don't have many many texts it's going to be at the same speed at the very minimum or maybe faster than the python tokenizer standard python tokenizer but also it has many more features so we'll look at them in the second part of the course mainly but it has features that have been designed specifically for tasks like like um token classification or question answering that allow you to know for instance if from which for the token comes from or to exactly each which kind of text both tokens are represent in the original text which is features that are a little bit uh not a little bit way harder to be able to get with the slot organizer um another question is um are there any similar tutorials or resources for sentiment analysis for multi-label data or regression tasks for sequences not right now which that's a very good question so not right now and we should definitely work on that so the main thing you would have to change is the post processing at the end instead of playing soft max you would apply for regression you wouldn't apply anything i guess and then for a multi-label so multiple label multiple possible labels for each of your sentence you would apply probably a stimulate to your result and then last question is the first two input ids the two lines are the same but the words are not why is that so we have to look back at the decoding because the two words are indeed the same so the sentence here begins with 101 and 101 which is the idea of that cls token so that's why you have the first two input ids uh that are the same and then the 1045 correspond to the i because the two sentences begin with i and then it starts being different because you have the the r8 for the two two sentences and another question would it be possible to do a video explaining the code structure the library and the id behind it so that's to make me to maybe make it easier to contribute very very good question and it's actually scheduled for the last part of the course uh in the part 3 of the course we have a chapter that's going to be dedicated to how to contribute to the aging face libraries in particular the transformers library and then we'll have videos explaining the constructor of all the libraries of the interface ecosystem [Music] again don't hesitate to ask any questions i'm going to pause regularly to answer them so that's uh pretty much everything that was behind the pipeline and we've seen it in detail in the code so now let's have a look at the main object inside the pipeline which is a model um so again we have a short video that can which i'm going to show from my computer and then we'll look at the code in detail and if you have questions i can answer them and live code with you let me just grab the video which of course i can't find easily otherwise that would be too easy why did it disappear and almost there i promise i just moved it recently [Music] and how to instantiate a transformers model in this video we'll look at how we can create and use the model from the transformers library as we've seen before the auto model class allows you to instantiate a pre-trained model from any checkpoint on the interface up it will pick the right model class from the library to instantiate the proper architecture and load the weights of the pre-trained model inside as we can see when given a bird checkpoint we end up with a bird model and similarly for gpd2 or part behind the scenes this api can take the name of a checkpoint on the hub in which case it will download and cache the configuration file as well as the model waits file you can also specify the path to a local folder that contains a valid configuration file and a modal waste file to instantiate the retrained model the auto model api will first open the configuration file to look at the configuration class that should be used the configuration class depends on the type of the model bert gpt2 or part for instance once it has a proper configuration class it can instantiate that configuration which is a blueprint to know how to create the model it also uses this configuration class to find the proper model class which is then combined with the loaded configuration to load the model this model is not yet a pre-trained model as it has just been initialized with random weights the last step is to load the width from the model file inside this model to easily load the configuration of a model from any checkpoint or folder containing the configuration file we can use the auto config class like the auto model class it will pick the right configuration class from the library we can also use a specific class corresponding to a checkpoint but we'll need to change the code each time we want to try a different model architecture as we said before the configuration of a model is a blueprint that contains all the information necessary to create the model architecture for instance the birth model associated with the pert base case checkpoint has 12 layers a hidden side of 768 and a vocabulary size of 28996 once we add the configuration we can create a model that has the same architecture as our checkpoint but is randomly initialized we can then training from scratch like any by torch model we can also change any part of the configuration by using keyword arguments the second snippet of code instantiates a randomly initialized birth model with 10 layers instead of 12. saving a model once it's trained or fine-tuned is very easy we just have to use the safe retrain method here the model will be saved in a folder named my belt model inside the current working directory such a model can then be reloaded using the from pretrained method to learn how to easily upload this model to the lab check out the push to up video okay um so let's see if we have any questions not just yet don't hesitate to ask any question in the chat and i'll answer them regularly and for this let's open the collab and look a little bit at the code behind the auto model api and in particular we'll see for instance i told you a little bit earlier that our model could return more than just delegates and it could return for instance all the hidden states or things like that and we'll see how to do that just here yes [Music] so if you had any questions now would be an idiot time because i didn't execute this notebook in advance so we need to wait for it to install everything okay that didn't take so long um so to create a random model that looks exactly like the birth model uh we can just instantiate the default configuration and use that configuration inside the model like we saw in the video the config contains lots of fields that are related to what's happening inside the model so for instance we have the hidden size configured we have the number of words that our model can taken we have the vocabulary size the modal type which is birth the activation it's used which is kilo etc um so that model uh using just random which is just the config is going to be randomly initialized and there is nothing to download there if we want to use a pre-trained model we have to use the from betrayal method which is going to download the exact config and then the model awaits and as we saw in the video it's gonna use the config to first instantiate a randomly initialized model and then load the weights from that checkpoint inside the model we have and if we want to change anything in a model more specifically in its configuration we can say it in several places so for instance we can start with [Music] a config that is exactly like bert [Music] so con autoconfig.from pre-trained purpose case which is gonna download and it's already downloaded from here oh what do you mean the two config is not defined oh yeah i've only used bird config so let's continue with that [Music] so that conflict from pretend which is going to reuse the config that was downloaded here so this is the configuration of the pert model and if we want to change anything inside it uh we've seen this video for instance how to change the number of hidden layers but let's say that i want to change the fact that i want my model to return all the hidden states which i would say with output in states equal to so i can do this in the config and then instantiate my model with model equal birth model config i can also directly change this when i do bert model dot from pre-trained here [Music] so since if if i were to change here the number of hidden layers it wouldn't work anymore the command would fail because i would i would then try to load a checkpoint that has been defined with 12 layers inside the model with 10 layers so by a torch would complete i mean it would it would probably work but i would have a warning with like the weights not being used and the model would probably not get super useful results but for something like outputted and states which doesn't really change the the way the model was pre-trained this is going to work super nicely and if i try to take inputs so let's define some random inputs and then pass it to a tokenizer so i'll have to [Music] use the pert tokenizer and then instantiate it with the from pre-trained method oh yes it's finding it here should have executed the cell already or [Music] so if i'm creating a tokenizer like this and then applying it to my inputs [Music] return a tensor i don't need to put the padding and location that we saw before because there is only one sentence we'll see why a little bit earlier so once i have done that i can look at my outputs and it should have now two keys uh still one key yeah with a little bit more because the bird model has a polar output you know on top of the luggage but uh oh and it's not luggage anymore sorry it's last hidden state because this is not the classification model it's a base model i used birth model which is the same as using auto model not a part model for second classification so i get a last in states uh instead of lock key the puller output is specific to board so it has it's all it always has that and then i can see i have a last key with hidden states and a list of all the transforms which correspond to all the hidden states of my model so this is how you change the configuration of your model on the fly either inside the when you create the config if you are trying to initialize a randomly initialized model or to the front pre-trained method if you are trying uh to use a pre-trained model in particular if you are using a classification model for instance a sequence specification models you can specify the most important argument it's going to be new labels because if when you add your classification add you want to control how many outputs that classification head has so you would do that with the new labels argument um so that model and then once you finish training or fire tuning your model you can use safepretrain to save it on the floor on your on your hard drive and you can use push web which we just released today actually uh so on your model to directly upload your model on the rigging facehub so that anyone in the world can use it i don't see any questions again don't hesitate to ask any questions i'm gonna answer them regularly and so this is all we have to this is all we there was in the section for models and then let's look at the tokenizer which is responsible for preprocessing the input i'm going to move myself oh not the screen come back i'm going to move myself back on the left because and we we look at this section here and look at the code inside the collab [Music] um so tokenizer [Music] um oh yeah let's look at the video with tokenizer's overview first and then i'll comment everything that's happening in this section so tokenizer introduction video in the next few videos we'll take a look at the tokenizers in natural language processing most of the data that we handle consists of raw text however machine learning models cannot read or understand text in its raw form they can only work with numbers so the tokenizer's objective will be to translate the text into numbers there are several possible approaches to this conversion and the objective is to find the most meaningful representation we'll take a look at three distinct tokenization algorithms we compare them one to one so we recommend you take a look at the videos in the following order first word based followed by character based and finally so board paste [Music] so we won't look at the video actually we're going to look directly at the text inside the inside of the the course and our comment because we won't have time to watch all those videos in the slot we have um so word-based tokenizers uh so you can look at the video in your free time but we're going to explain it a little bit more in depth about what's i'm able to do but the words based organizer it's just going to split your sentence by word so the easiest way to do that is to take all the spaces and then split your your text on those spaces uh more advanced would be to include some rules to split and punctuation so for instance the exclamation mark separated from tokenization or here let's split it between let and uh above s so we can see this on this example with jim henson was a puppeter uh which is separated into five words here so the word organizers are were used a lot before transformers mostly the advantage is that you split naturally your text onto the spaces and punctuation the disadvantage is that you end up with pretty large vocabularies because they have lot there are lots of different words in english and every time someone makes a typo in some world you end up with a new within a very new world in your vocabulary so each word gets assigned an id starting from zero going up to the size of vocabulary and then since we can't guarantee that the user is never going to make a tape or anything there are there's a special rule uh if we encounter a token that's not that doesn't exist in the vocabulary it's usually replaced by something called the unknown token which is usually something that looks like that and between brackets so this is uh one of those one of the other drawbacks of the of the world-based tokenizer so the first one is that they have very large vocabularies the second one is that they need to learn that the word dog and the word dogs are very similar they won't know that from scratch because when the model is initialized randomly it's going to have a set of a meaning for that word dogs and another one for that word dogs and he's going to need to learn by seeing lots and lots of data that those two words look a bit a bit alike and the last disadvantage is that untoken so every word that's a typo is going to end up like this and the model can't really learn any representation of that it's going to just it's it's as if you had just deleted the word in the sentence so another way is to just split your text on all characters so which is what character based organizers do in this case your vocabulary is not going to be very large because there are 256 ascii characters for instance a little bit more if you take the wall unique thing but you're not going to end up with models that have a vocabulary size of i don't know 300 000 or something like that so this is better for the vocabulary size you probably won't get a known token because you all see all the all the different characters possible but the drawback is that now the representation is based on character so uh the model has to learn that for instance the letter e uh is not uh does not mean the same thing when it's between an l and a t uh where there were compared to the letter he here with the k and inside the word organization so is a representation of each letter is less meaningful that's what i'm trying to say and compared to what we had with words the other drawback is that we end up with very long sentences for instance for let's do tokenization if we look back with to our word-based organization it will split into five words with the tokenization with the character-based organization it splits in much more concurrent here but it's in between 15 and 20 let's say so we end up with longer sentences and our transformer models are usually constrained by a maximum slang so for instance the birth model can only do a five five can only treat 512 tokens at a time so using a character-based tokenization algorithm would make your the maximum sentence you can fit the model pretty short so that's why transformer models usually use a compromise between word and character-based organization which is subworth tokenization so server tokenization as the name indicates it's gonna split your your text into sub-worlds so it's still split between words but some words are cut into for instance here you've got let's do and then token and ization are separated into notice that you get this small and uh with like the i never know to say that in english but that special thing between uh um an inferior and superior sign with slash w which means that it's the end of a world so token doesn't have it because that's for that's because we want the model to be able to differentiate token as a single word and token followed by something else like tokenization so the has the suffix that says here it's in the other word but token does not and so that depends on the convention used by this organizer some tokenizers have a thing at the beginning of words i'm talking about the thing at the end of words and so this approach this approach allows you to have a vocabulary that's not going to be too huge and the tokens still have some meaningful semantic some some semantic meaning uh that's more meaningful than just characters and uh the last thing is that for words uh based organizers for instance doug and doug's uh were two separate words here doug as dogs would probably be split into doug and ness the same way tokenization is split between token and ization so it can learn that they have the same prefix and then as the suffix ization is going to be used in other words like modernization and the tokenizer can the model sorry can then make sense of the suffixes and learn that they are always kind of the same um and so in the next part of the course uh we'll look at uh into details the different because there are three different um support organization algorithms that are byte level world peace and sentence piece we'll explain exactly the difference between them in the next part of the course so let's see if we have any questions before we look at how the transfer the tokenizer work in practice yes i'm just gonna put myself here properly tokenizer pricks a sentence into tokens but no limitizations or stemming is performed before learning um you know it's let me double check i would say no but you should ask the questions from more people that are more competent than me can answer you because i'm not completely sure could you provide intuition into wordpiece use in pert and sentence piece type tokenizers i could but it's going to take a bit of time so again i'm going to redirect you on the forum where i can take the time to properly answer you and there is also uh maybe lewis can share it here for his tokenizer summary in the transformers documentation that explains the difference between a word piece and sentence piece which is using unique grumpy as the scene and the key differences between the two does the w slw tag add any information in the subwort organization so yes as i said it's it's what allows them all to know the difference between a single word like i mean between token used as a single word or token inside a word like tokenization or tokenizer and then let's see how the tokenizer work in practice so you can we have seen how to load the tokenizer using the from pre-trained method and what it returns and we'll now quickly look at the video on the tokenization pipeline which is going to explain what's happened when we feed the tokenizer sequence like that and how it returns those numbers let me just grab it from my computer and then i'll continue answering questions the tokenizer pipeline in this video we'll look at how tokenizer converts vertex to numbers that a transformer model can make sense of like when we execute this code here is a quick overview of what happens inside the tokenizer object first the text is split into tokens which are words parts of words or punctuation symbols then the tokenizer adds potential special tokens and converts each token to our unique respective id as defined by the tokenizer's vocabulary as we'll see it doesn't quite happen in this order but doing it like this is better for understandings the first step is to split our input text into tokens we use the tokenized method for this to do that the tokenizer may first perform some operations like lowercasing all words then follow a set of rules to split the result in small chunks of text most of the transformer models uses a word organization algorithm which means that one given word can be split in several tokens like tokenize here look at the tokenization algorithms video linked below for more information the ash hash prefix we see in front of eyes is a convention used by bird to indicate this token is not the beginning of the world other tokenizers may use different convention however for instance albert tokenizers will add a long underscore in front of all the tokens that added space before them which is a convention shared by all sentence-based organizers the second step of the tokenization pipeline is to map those tokens to their respective ids as defined by the vocabulary of the tokenizer this is why we need to download the file when we instantiate a tokenizer with the frompretrade method we have to make sure we use the same mapping as when the model was betrayed to do this we use the convert tokens to ids method you may have noticed that we don't have the exact same results as in our first slide or not has this looked like a list of random numbers anyway in which case allow me to refresh from memory where the number at the beginning and another at the end but are missing those are the special tokens the special tokens are added by the proper formula method which knows the indices of a token in the vocabulary and just adds the proper numbers in the input ids list you can look at the special tokens and more generally at how the tokenizer has changed your text by using the decode method and the output of the tokenizer object as for the prefix for beginning of words part of words both special tokens vary depending on which organizer you are using the belt tokenizer uses cls and set but the robot tokenizer uses html like on course s and slash s now that you know how the tokenizer works you can forget all those intermediate methods and only remember that you just have to call it on your input texts the output of the tokenizer don't just contain the input id however to learn what the attention mask is check out the batch input together video to learn about docker type ids look at the processors and statements this video so we have one questions that's linked to what we were seeing just before the video are token slash w and token going to have separate representation ids and yes they are going to have a separate representation ids because they are not the same token uh which is the the whole meaning of that slash w token not token sorry suffix so tokenizer is a tokenization pipeline i'm not going to live code those intermediate method because you shouldn't really learn them we're just showing them to show you the steps inside the pipeline and the main thing to remember is that you just have to call your tokenizer on your input like this because this is the main api that's the most useful and now we'll look at uh what the attention mask is and what padding and truncation means the the arguments that we had at the very beginning uh so that we fully explain what's happening inside all the tokenizer oh another question is there a reason you would save a train tokenizer oven but to just have a local copy um very good question so yeah there is no real reason to save your patreon organizer if you don't need to if you didn't make any change inside it and you always you would always have a local copy because autotokenizer.frompretrained is going to cache the files to avoid you downloading them again so there is no reason to save it the one exception is when you are creating a folder that you want to push to the model hub in which case you you should save your tokenizer inside that folder so that when you push to reserve you push your model the configuration and the tokenizer that's used with it and with all those three things uh the you the the face website is going to be able to apply the inference api in your model and you will be able to play with the widget online although other than that you won't really need to use the safe pre-trained metadata on the tokenizer it's mostly for the model that's going to be super useful or and we will see in part two of the course to do that if you're training a tokenizer from scratch because you're pre-training a model for instance in a new language then you'll need to use the safe retrain method to save the results of your tokenizer so we're going to watch the last video for today live session about batching inputs together and then we'll look more closely at the code together let me just launch the collab first so that we don't have to wait after the video and then we'll watch a video together come on yes i want to run it and let me grab the video patching inputs together [Music] how to batch inputs together in this video we'll see how to batch input sequences together in general the sentences we want to pass through our model won't all have the same lengths here we are using the model we saw in the sentiment analysis pipeline and want to classify two sentences when tokenizing them and mapping each token to its corresponding input ids we get two lists of different lengths trying to create a tensor or numpy array from those two lists will result in an error because all arrays and answers should be rectangular one way to overcome this limit is to make the second sentence the same length as the first by adding a special token as many times as necessary another way would be to truncate the first sequence to the length of the second but we would then lose a lot of information that might be necessary to properly classify the sentence in general we only truncate sentences when they are longer than the maximum length the model can handle the value used to pad the second sentence should not be picked randomly the model has been retrained with a certain padding id which you can find in tokenizer.pat doconate now that we have added our sentences we can make a batch with them if we pass the two sentences to the mould separately and patched together however we notice that we don't get the same results for the sentence that is padded here the second one hm is that a bug in the transformers library no if you remember the transformers will all make easy use of attention layers this should not come as a total surprise when computing the contextual representation of each token the attention layers look at all the other words in the sentence if you have just the sentence or the sentence with several paddling tokens added is surgical we don't get the same values to get the same results with or without padding we need to indicate to the attention layers that we should ignore those padding tokens this is done by creating an attention mask a tonsil with the same shape as the input ids with zeros and ones once indicate the tokens the attention layers should consider in the context and zeroes the tokens we should ignore now passing this attention mask along with the input id will give us the same results as when we send the two sentences individually to the model this is all done behind the scenes by the tokenizer when you apply it to several sentences with the flag padding equal to it will apply the padding with the proper value to the smaller sentences and create the appropriate attention mask so let's look at the same thing in collab unless there are any questions no i don't see any new questions don't hesitate to ask your questions in the chat again and uh let's look at the same code that we had to uh look again at what the padding and attention mask are exactly so as we saw in the video if we try to apply our model directly oh no it's not exact same thing as in the video if we try to apply our model directly on just one sentence uh that we tokenize and convert it to ids like that uh using the same code as in the previous video it's gonna fail because the model wants batches of input so it wants um it's actually the tokenizer even

Original Description

This is a recording of the twitch session on June 24th 2021. Chapter 2 of the course: https://huggingface.co/course/chapter2 Have a question? Checkout the forums: https://discuss.huggingface.co/c/course/20 Subscribe to our newsletter: https://huggingface.curated.co/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from HuggingFace · HuggingFace · 41 of 60

← Previous Next →

The Future of Natural Language Processing

The Future of Natural Language Processing

Trends in Model Size & Computational Efficiency in NLP

Trends in Model Size & Computational Efficiency in NLP

Increasing Data Usage in Natural Language Processing

Increasing Data Usage in Natural Language Processing

In Domain & Out of Domain Generalization in the Future of NLP

In Domain & Out of Domain Generalization in the Future of NLP

The Limits of NLU & the Rise of NLG in the Future of NLP

The Limits of NLU & the Rise of NLG in the Future of NLP

The Lack of Robustness in the Future of NLP

The Lack of Robustness in the Future of NLP

Inductive Bias, Common Sense, Continual Learning in The Future of NLP

Inductive Bias, Common Sense, Continual Learning in The Future of NLP

Train a Hugging Face Transformers Model with Amazon SageMaker

Train a Hugging Face Transformers Model with Amazon SageMaker

What is Transfer Learning?

What is Transfer Learning?

The pipeline function

The pipeline function

Navigating the Model Hub

Navigating the Model Hub

Transformer models: Decoders

Transformer models: Decoders

The Transformer architecture

The Transformer architecture

Transformer models: Encoder-Decoders

Transformer models: Encoder-Decoders

Transformer models: Encoders

Transformer models: Encoders

Keras introduction

Keras introduction

The push to hub API

The push to hub API

Fine-tuning with TensorFlow

Fine-tuning with TensorFlow

Learning rate scheduling with TensorFlow

Learning rate scheduling with TensorFlow

TensorFlow Predictions and metrics

TensorFlow Predictions and metrics

Welcome to the Hugging Face course

Welcome to the Hugging Face course

The tokenization pipeline

The tokenization pipeline

Supercharge your PyTorch training loop with Accelerate

Supercharge your PyTorch training loop with Accelerate

The Trainer API

The Trainer API

Batching inputs together (PyTorch)

Batching inputs together (PyTorch)

Batching inputs together (TensorFlow)

Batching inputs together (TensorFlow)

Hugging Face Datasets overview (Pytorch)

Hugging Face Datasets overview (Pytorch)

Hugging Face Datasets overview (Tensorflow)

Hugging Face Datasets overview (Tensorflow)

What is dynamic padding?

What is dynamic padding?

What happens inside the pipeline function? (PyTorch)

What happens inside the pipeline function? (PyTorch)

What happens inside the pipeline function? (TensorFlow)

What happens inside the pipeline function? (TensorFlow)

Instantiate a Transformers model (PyTorch)

Instantiate a Transformers model (PyTorch)

Instantiate a Transformers model (TensorFlow)

Instantiate a Transformers model (TensorFlow)

Preprocessing sentence pairs (PyTorch)

Preprocessing sentence pairs (PyTorch)

Preprocessing sentence pairs (TensorFlow)

Preprocessing sentence pairs (TensorFlow)

Write your training loop in PyTorch

Write your training loop in PyTorch

Managing a repo on the Model Hub

Managing a repo on the Model Hub

Chapter 1 Live Session with Sylvain

Chapter 1 Live Session with Sylvain

Chapter 2 Live Session with Lewis

Chapter 2 Live Session with Lewis

The push to hub API

The push to hub API

Chapter 2 Live Session with Sylvain

Chapter 2 Live Session with Sylvain

Chapter 3 live sessions with Lewis (PyTorch)

Chapter 3 live sessions with Lewis (PyTorch)

Day 1 Talks: JAX, Flax & Transformers 🤗

Day 1 Talks: JAX, Flax & Transformers 🤗

Day 2 Talks: JAX, Flax & Transformers 🤗

Day 2 Talks: JAX, Flax & Transformers 🤗

Day 3 Talks JAX, Flax, Transformers 🤗

Day 3 Talks JAX, Flax, Transformers 🤗

Chapter 4 live sessions with Omar

Chapter 4 live sessions with Omar

Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker

Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker

Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker

Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker

Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker

Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker

[Webinar] How to add machine learning capabilities with just a few lines of code

[Webinar] How to add machine learning capabilities with just a few lines of code

Hugging Face + Zapier Demo Video

Hugging Face + Zapier Demo Video

Hugging Face + Google Sheets Demo

Hugging Face + Google Sheets Demo

Hugging Face Infinity Launch - 09/28

Hugging Face Infinity Launch - 09/28

Build and Deploy a Machine Learning App in 2 Minutes

Build and Deploy a Machine Learning App in 2 Minutes

Hugging Face Infinity - GPU Walkthrough

Hugging Face Infinity - GPU Walkthrough

Otto - 🤗 Infinity Case Study

Otto - 🤗 Infinity Case Study

Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it

Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it

Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models

Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models

🤗 Tasks: Causal Language Modeling

🤗 Tasks: Causal Language Modeling

🤗 Tasks: Masked Language Modeling

🤗 Tasks: Masked Language Modeling

This video teaches how to use the transformers library for sentiment analysis and sequence classification, with a focus on tokenization, model configuration, and post-processing steps. It covers the use of the pipeline object, AutoModel and AutoTokenizer classes, and the importance of special tokens and attention masks in tokenization.

Key Takeaways

Load a pre-trained model using the AutoModel class
Use the pipeline object to load the model and reprocess inputs with a tokenizer
Convert logits to probabilities using post-processing steps
Add special tokens at the beginning and end of the input sequence
Use attention masks to ignore padding tokens in attention layers

💡 The transformers library provides a unified API for 60+ architectures, including BERT, and the pipeline object can be used to load the model and reprocess inputs with a tokenizer.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

You Are Not Behind. The World Is.

You're not behind, the world is still adapting to AI, and it's okay to take your time to learn and grow

Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area

Learn how to choose between a Computer Science and Engineering career path or combining programming with a core engineering background in the age of AI

The AI Hype Cycle: Calm Before the Next Breakthrough?

Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions

Medium · Programming

AI won’t replace scientists. It will make the current model of science obsolete

AI is not replacing scientists, but rather making the current model of science obsolete, enabling new forms of discovery and collaboration

Medium · Data Science

Motorist saved by human chain | 9 News Australia

9 News Australia