Chapter 2 Live Session with Lewis
Key Takeaways
This video covers the Transformers library, tokenization, and fine-tuning of pre-trained models for text classification tasks, with a focus on BERT and DistilBERT models.
Full Transcript
basically the the goal of this session is to go through chapter two together and um in this chapter what we're going to be doing is diving into the sort of internals of the transformers library and in particular we're going to be looking at the sort of models so there's a set of model apis that we're going to look at and also the tokenizers which we relied on heavily to convert text into a format that the models can process and so you've seen in the first lessons that we have a pipeline api and this pipeline api basically wraps all of the complexity of pre-processing and post-processing text and also feeding it to the model so that you just have to basically give it a sentence and then you could classify for example the sentiment and today we just want to sort of unpack what's happening inside this this function and also to understand some of the different sort of approaches you can take for tokenizing your text and also how to save and load the models and tokenizers and we'll finish by looking at what you have to kind of do when you're dealing with some sentences or text that have different lengths because it turns out that in pi torch and tensorflow and most deep learning frameworks we need a kind of standardized sort of rectangular input for our models and basically the way we're going to do this is i'm going to go through the sections and then pause for questions but in the meantime if you have some sort of very urgent thing that you want to ask omar will be here helping us um with guidance so just to give you a taste um of like what's um we're going to be sort of covering but at a higher level um every single model in the transformers library has a corresponding modeling file so for example here what i'm looking at is the the modeling file for bert and this this file has all of the source code for all of the different tasks that you can use bert for so for example if i look for bert model this is the sort of base class that we're going to be looking at today which is responsible for basically creating contextualized embeddings of the inputs so how do we create kind of numerical representations of our text that have some sense of meaning and this this kind of class is is relatively simple it just has the embedding layer which you saw in in the first chapter so the thing that we passed through before we hit the transformer stack and then we have an encoder and this encoder is essentially responsible for converting these tokens or these token embeddings into these contextualized representations and i just recommend as sort of like your homework have a look through some of this code so any sort of class that you see us using today for example burp model have a look at the sort of source code and this really helps you understand how transformers work and at least for me personally it was only by sort of going through this kind of step by step and understanding how all the inputs go through the forward pass that i was really able to understand all the inner workings of a transformer so that's just a little side note so um to get started maybe let's have a look at what really happens um behind the pipeline so let's kick start with this video what happens inside the pipeline function in this video we'll look at what actually happens when we use the pipeline function of the transformers library more specifically we'll look at the sentiment analysis pipeline and how it went from the two following sentences to the positive and negative labels with our respective scores as we've seen in the pipeline presentation there are three stages in the pipeline first we convert the vertex to numbers the model can make sense of using a tokenizer then those numbers go through the model which outputs the bits finally the post-processing steps transform those delegates into labels and scores let's look in details at those three steps and how to replicate them using the transformers library beginning with the first stage tokenization the tokenization process has several steps first the text is split into small chunks called tokens they can be words part of words or punctuation symbols then the tokenizer will add some special tokens if the model expects that here the model used expect to see a list token at the beginning and a set token at the end of the sentence to classify lastly the tokenizer matches each token to its unique id in the vocabulary of the pre-trained model to load such a tokenizer the transformers library provides the auto tokenizer api the most important method of this class is from pre-trained which will download and cache the configuration and the vocabulary associated to a given checkpoint here the checkpoint used by default for the sentiment analysis pipeline is distilled based on case fine-tuned ss2 english which is a bit of a mouthful we instantiate a tokenizer associated with a checkpoint then feed it to the two sentences since the two sentences are not of the same size we'll need to pad the shortest swan to be able to build an array this is done by the tokenizer with the option padding equal to with torquation equal true we ensure that any sentence longer than the maximum the model can handle is truncated lastly the return sensors option tells the tokenizer to return the pytorch tensor looking at the result we see we have a dictionary with two keys input ids contain the ids of both sentences with zero where the padding is applied the second key a tension mask indicates where padding has been applied so the model does not pay attention to it this is all what is inside the tokenization step now let's have a look at the second step the model as for the tokenizer there is an auto model api with a form retraining method it will download and cache the configuration of the module as well as the pre-trained weight however the auto model api will only instantiate the body of the model that is the part of the model that is left once the protraining head is removed it will output a high dimensional tensor that is a representation of the sentences passed but which is not directly useful for our classification problem here the tensor has two sentences each of 16 tokens and the last dimension is the indent size of our model 768. to get an output link to our classification problem we need to use the auto model for sequence classification class it works exactly as the auto model class except that it will build a model with a classification head there is one auto class for each common nlp task in the transformers library here after giving our models of two sentences we get a tensor of size 2x2 one result for each sentence and for each possible label those outputs are not probabilities yet we can see they don't sum to one this is because each model of the transformers library returns lockets to make sense of those logits we need to dig into the third and last step of the pipeline post processing to convert logics into probabilities we need to apply a soft max layers to them as we can see this transforms them into positive number that sum up to one the last step is to know which of those correspond to the positive or the negative level this is given by the id to label field of the model config the first probabilities index 0 correspond to the negative label and the seconds index 1 corresponds to the positive loop this is how our classifier built with the pipeline function picked those labels and completed those scores now that you know how each step works you can easily tweak them to your needs so um let's see all right so do we have any questions at this stage about the pipeline so one of the things that we saw is that there's these kind of three components there's like a pre-processing stage okay great okay great so so one of the first questions we have is could you please explain the intuition um behind the bert sst ii english checkpoint and what are the different flavors of checkpoints to be used and how did we choose sst2 okay great so basically um each of the transformer models they have a sort of pre-trained base or pre-trained backbone which um i think you saw in the first chapter and then what we do typically with models like bert and gbt is we fine-tune them on a downstream task so the idea is that you take for example bert which was pre-trained on wikipedia and the book corpus and then you say okay i want to do classification now so i'm going to basically take these weights that i had in my original model and i'm going to add a classification head which is going to just basically be a linear layer that allows us to do the classification task and then we do the fine-tuning step on a particular task so if you want to understand a little bit about how the [Music] the models work or the description of the models if we look at bert and what was this guy called this is burt [Music] uncased fine-tuned so uncased tuned um and then it's ss what was it fine tuned ah distilled okay still bit uncased fine-tuned what am i missing here ah face distilbert bass on case function okay so if we look at this then what we can see is that this is a checkpoint that was fine-tuned on a particular task so this task here is called the the tree bank task the binary classification benchmark and i think from memory this is just like a sentiment analysis task which just has a label or data given in terms of just two labels like you know positive or negative so the the basic idea of like why did we choose this or the basic answer is that we were just trying to sort of demonstrate how the the pipeline works for sentiment analysis and this is one model which is well suited for that task so i hope that answers your question dk crazydiv um and then the other question we have is in this case we assume that there are only two classes for classification how do we specify a multi-class problem and what checkpoint would you use okay great that's a very good question so maybe what we can do is let's have a look at the collab for this chapter so here we've got a sentiment analysis pipeline and of course it's just going to predict two two classes and so now i'm just going to instantiate the tokenizer and here's the model okay so if we look at a model every single model has a config and this config tells you things for example like the number of classes so you can see we've got two classes here and what you can do when you instantiate a model you can define the number of classes you would like when you instantiate the thing for text classification so just to give you an example let's suppose that i take a checkpoint for multi-class now i'm going to do two things here i'm going to show you first how do we instantiate a model that we then would fine-tune ourselves and then i'll show you the sort of simpler case where we have an existing pre-trained model so if i don't have imagine i just have my own data set and there's no model on the hub that is suitable for what i want to do what i might do is i'll say okay i'm going to take distilbert base and cased and this this is just the the pre-trained model there's nothing um sort of special about it i still have to do some work and then what i could do is i can say okay i'm going to take i first need to take from transformers i'm going to import an auto model but now i'm going to do it for sequence classification so this is where any time you're dealing with like text classification or you know multi-class multi-label these things this is a sequence classification task and then what i'm going to do i'm going to take my model for sequence classification and then i can do my usual from pre-trained take my checkpoint that i've got now this new one and then what i can do i can pass keyword arguments that will specify how many labels i'm dealing with so imagine that my data set has six classes that i'm dealing with so what i can do is i can say the number of labels is six and now what will happen it will download the the base model or the pre-trained model for distilbert and it will then add a classification head on top of this model and it will configure it with the right number of classes so that you know we can do fine-tuning appropriately so now if we look at our config you can see that it's already initialized the model with six different classes and we don't know the labels yet because we haven't provided our own data set and our own labeling convention but we could do that and then from here we could then just fine tune and train the model exactly as we've done or we will do in the next chapter so that's one way of doing it now the other part of the question is how do i take a sort of pre-trained model or fine-tune model from the hub and this is a little trickier to figure out like you know which model is suitable for your task so the way i usually do it is i look for example at text classification so i do a filter here on text classification and then i sort of like ask myself okay maybe i'm dealing with let's see now this isn't so easy to find a multi-class example so i think in general yeah so actually finding the the multi-class model that is suitable for your task takes a bit of work i mean maybe maybe omar already knows a fast way to get this but generally speaking all of the models that we have here are in some sense fine-tuned on a task so for example like this german sentiment bert presumably is two classes and one way you could quickly check that is by looking at the files and versions and seeing in the configuration how many labels you have so in this case there's three labels but actually searching for this effectively on the hub i'm not sure maybe maybe there's a way of doing this or maybe this is a good feature we should add in the hub so i am holmes i hope that sort of partially answers your question um but if not then feel free to to write in the chat yeah exactly we should add a feature for this it's great we basically i think what we would like is a filter where we could filter between binary classification multi-class and multi-label and then that would allow us to to refine things but good questions awesome okay so are there any more questions about the pipelines before we look a bit more at the code okay so in that case let's let's um have a sort of walk through this uh this collab with um the pipeline to sort of get a deeper understanding of of what's going on so we've got this um uh example here where we're basically downloading um the sentiment analysis pipeline and we've got now the classifier which we can feed um these two texts that you saw in one of the earlier chapters um but now what we want to do is we want to understand what really is happening under the hood so remember that the first thing we need to do is we need to process or pre-process these raw texts because basically all neural networks can't do operations on raw text it's kind of like imagine you want to do like matrix multiplication how do you do that on like a string and so what we can do instead is we use a tokenizer and one of the key things that you should remember is that if you're doing any sort of fine tuning or any like sort of inference or predictions it's really important that the checkpoint you use here is the same for the tokenizer and the model and that's because when these transformers are pre-trained on a large corpus there's a corresponding tokenizer that was also fined or trained in some sense to learn the vocabulary of that corpus and so if you pick if you sort of mix and match a checkpoint for one tokenizer and then a different checkpoint for the model basically you'll get a mismatch in the vocabulary and then you'll get kind of garbage um in your in your outputs so just that's one sort of thing to watch out for okay so we've got a tokenizer and now we've got these same raw inputs and if we basically feed these two sentences into the tokenizer you get generally there are two things that you just sort of need to remember you're going to get something called input ids and these input ids are basically a mapping of every single token in our sequence to a unique number or a unique integer to be precise and this is basically a mapping in the vocabulary so imagine that i was thinking about like i don't know the the whole english language where i'm just dealing with words then i'm gonna have probably several hundred thousand um words or tokens in my vocabulary and then if i get like the word whole i would like to be able to match that to a number that corresponds to this mapping in the vocabulary but as we saw i think in the first chapter or in fact we might see it as well today this kind of like tokenization in terms of words is not very efficient and so what we usually do is something a bit cleverer but the basic idea is that every single token in this input is going to be mapped to a number and then those numbers allow us to sort of distinguish between different tokens in the sequence so that's what input ids are and the other thing that you're going to see today in more detail is something called an attention mask and i'll explain a bit more later on what this is really doing but you can already see that it's kind of putting a bunch of ones at some part of the sequence and a bunch of zeros towards the end of the sequence and this will become clearer later on okay so we've got um the the tokenizer so we've now converted our raw text into these ids all these numbers we can operate on and then let me just make sure i load the correct checkpoint here so now we're going to load the model so this is the thing that will process these inputs and let me just delete this okay and so then the question is how do you feed your inputs to your model so the the simplest way is to just take this dictionary that we have here which has two keys it has input ids and attention mask and then we can just use the standard python unpacking operator to just feed all of the keys and values to the model and when we do this this will basically feed the inputs to the forward pass of the model to generate the outputs and so one way we could look at that i think we can probably do this if we look at the forward you can see here um in in the collab it's showing us basically what the arguments this forward pass can accept so it tells us we can accept input ids we can have an attention mask and then there are like some more kind of sophisticated or advanced things we could also provide but you know we don't need to do them for today but just so you know there are other things that you can do so you can see that okay we need to provide at least these input ids and attention mask and so when we do the unpacking like here this will basically run through the forward pass and produce some outputs and as we saw in the video these outputs um are basically called like hidden states and these hidden states are just some sort of like like say compressed representation of the text so we're taking this raw text we're converting it first into numbers and then we're taking those numbers and then we're converting those sort of integers into dense vectors so basically every token is now associated with a vector and in this case we've got 16 vectors per sentence and each vector has 768 dimensions and that's just because of the way bert was or distilbert as well was pre-trained so let's have a look at one of these vectors so we've got outputs so i'm going to take the first sentence so that's the first index and i'm going to look at the first token of this sentence and so if you look at this ah they must be slices or integers ah okay because i need to do last hidden state okay good so actually let's just take one step back if we just look at the raw outputs you can see that in transformers all the outputs from the models are usually wrapped in an object which is kind of something we can then like you know index by attribute name and so here we've got something called the base model output and then this has in this case just a single attribute called the last hidden state and the tensor so if i want to then access this last hidden state now i've got a tensor which has the um thing i wanted to do so i'm going to get the first sentence i'm going to get the first vector or the first token sorry the the vector corresponding to the first token and this is now this you know huge thing of you know numbers from you know negative to positive and this should have a size of 768. where are we yeah so this is basically the the numerical representation of the first token in the first sequence or the first sentence we passed okay so let's just check are there any questions okay cool so let's carry on um okay so this is basically what the numerical representations are produced by the model and then as we saw in the video these numerical representations by themselves they don't let us do things like text classification they just say the numerical representation of this token is blur and now if we want to do classification we need to take that that that vector or these feature vectors and then we need to add them or combine them with a classification head and so the whole transformer library is built around this idea of like taking like a model for task x and task x can be things like sequence classification question answering summarization translation so on so forth and in this case when we instantiate a model with sequence classification as we saw before this is now going to create a model which has a number of labels so you can see here we've now got a model with two labels because that's what this pre-trained checkpoint has and then when we look at the outputs we've now got instead of having just these last hidden states we've got logits and these lodges are basically what happens when you you know feed these feature vectors through this linear layer this will now compress these 768 dimensional vectors into just two numbers or project them into two numbers and these are the things that we can then use to derive probabilities and figure out for example which class is the most likely so you can see here that you know this one here is more likely than this one and vice versa because i think the second example is like a negative sentiment okay so that's more or less uh how we think about the outputs from a model versus a model with a classification head and here what we can see is if we want to convert our lodges from into probabilities we can just take a softmax max over them and you may remember that a soft max basically takes all of the inputs exponentiates them and then it normalizes that exponential by the sum of all the exponentials so you basically end up having something that ranges from zero to one so it's a good candidate for a probability and if we do that we then get now probabilities for each of the two sentiments and also we now can see this is the way we can map between the label id which says you know what what does zero mean in terms of something that's a bit more meaningful okay so let's have a look let's see okay great so we have a question from srm sumiya which says the classification model should take the output from the distiller model that's exactly right so in fact let's let's have a look at this um if we look at class i'm doing this for bert but it's the same for distilbert so if we take bert model for where is it for sequence classification so if you look at what this model actually has it has the burp model that we saw or the distilled model we saw in our example and then it just applies dropout and a linear layer and the linear layer has a dimension of the hidden size so the 768 and then it's going to compress that into just these two numbers defined by the number of labels and so if we look down at what happens inside the forward pass the first thing we do is we get the outputs from the burp model so these are just the feature vectors these 768-dimensional vectors and then you can skip most of this kind of stuff the the main point is that here we um uh well don't worry about the pooled output the main thing is that we feed these outputs into the classification head to produce the logits so that's a great question um okay so we've got a question from platon shiva so how can we see what the token representation means in the text so cool so maybe just to show you like um something that uh let's see maybe we get ahead of ourselves but that's okay okay so um we've got these raw inputs which are given um by these uh these strings and then we get these um input ids like this right and so one thing we could do if you want to go backwards and we're going to see this later but what i could do i can say okay tokenizer and i'm going to decode so i'm going to do the opposite of what i did before and now i'm going to take my input ids and fingers crossed this i need to do input ids and now you can see by using this decode method we're able to kind of reverse the process of the broad text but what it does is it also introduces some special tokens one is called the cls token which kind of just tells you this is like the start of the sentence and then we have a sep token which basically is used to distinguish between pairs of sentences so this is one way you can go back from where you started um and yeah if you if you have more questions we can tackle them as we go ahead okay cool so that's um the sort of first look at how the pipeline works under the hood um so now what we could do is let's have a look at like the models in more detail so i'm going to start by watching this video and then we'll pause for questions and then again look at some code how to instantiate a transformers model in this video we'll look at how we can create and use the model from the transformers library as we've seen before the auto model class allows you to instantiate a pre-trained model from any checkpoint on the interface up it will pick the right model class from the library to instantiate the proper architecture and load the weights of the pre-trained model inside as we can see when given a bird checkpoint we end up with a bird model and similarly for gpd2 or part behind the scenes this api can take the name of a checkpoint on the hub in which case it will download and cache the configuration file as well as the model waits file you can also specify the path to a local folder that contains a valid configuration file and a modal waste file to instantiate the retrained model the auto model api will first open the configuration file to look at the configuration class that should be used the configuration class depends on the type of the model bert gpt2 or bat for instance once it has a proper configuration class it can instantiate that configuration which is a blueprint to know how to create the model it also uses this configuration class to find the proper model class which is then combined with the loaded configuration to load the model this model is not yet a pre-trained model as it has just been initialized with random weights the last step is to load the weights from the model file inside this model to easily load the configuration of a model from any checkpoint or a folder containing the configuration file we can use the autoconfig class like the auto model class it will pick the right configuration class from the library we can also use a specific class corresponding to a checkpoint but we'll need to change the code each time we want to try a different model architecture as we said before the configuration of a model is a blueprint that contains all the information necessary to create the model architecture for instance the birth model associated with the bert base case checkpoint has 12 layers a hidden side of 768 and a vocabulary size of 28996 once we have the configuration we can create a model that has the same architecture as our checkpoint but is randomly initialized we can then training from scratch like any by torch model we can also change any part of the configuration by using keyword arguments the second snippet of code instantiates a randomly initialized part model with 10 layers instead of 12. saving a model once it's trained or fine-tuned is very easy we just have to use the safe retrain method here the model will be saved in a folder named my belt model inside the current working directory such a model can then be reloaded using the front pre-trained method to learn how to easily upload this model to the web check out the push to app video so any questions so far about the model like loading and saving models before we dive into some code so just to sort of summarize what we saw in the video um when whenever we do this from pre-trained method with the model we first need to get a config and we saw that config just a couple of minutes ago it defines things like the the mapping of the labels to the ids and how many labels the model has and that kind of stuff how many layers all those things and then that config is then used to load the weights of the model so that it makes sure that everything is kind of configured in the right way and then once we have this model we can then save it and then use it for other things so if there's no kind of urgent questions right now i'll have a look at the model's code just as a mention you can watch these videos and your own time and work through this kind of text but i think it might be sort of more useful if we just have a look at um at the code so um let's just check i can run transformers okay so one thing maybe to mention is um a really common example or situation that you'll find yourself in is you basically you've trained a model and now you want to share it in some way and the sharing typically at least when i was working in my previous company it was much more about deploying this model so that you could serve it or produce predictions that other services could consume and so once you've saved your model the question is okay what the hell do i do with this thing and um as we can see here this save thing will basically save two objects it will save a configuration json file and it will also save a pie torch model.bin file and this is something in pytorch called a state dictionary which basically provides all the information for the layers and the weights and so if we want to use this in like produce to produce predictions um the first thing we need to do is what we've always been doing is we take some input text we convert it into input ids and then we need to convert those input ids into tensors which we can then feed to the model and so previously what we were doing was using like the tokenizer and that's exactly what you would also do in practice but in this example we're just showing the outputs of the tokenizer so let's um have a look at what that looks like in code so let's check if there's any questions okay okay so um maybe just to quickly summarize we've got um you can also load your configurations using two different things you can either load your model directly from one of the default configs in the library and then this will provide you with like you know a kind of summary about the hidden size and so on um but if you do this the model is completely randomly initialized which means all the weights are just random and this model is going to just be garbage it's not going to help you make any good predictions and this is what you do actually when you want to pre-train a model or you want to really train a model from scratch so in practice most of the time what you're really doing is using the from pre-trained and then this will initialize the model with the pre-trained weights and the the correct head if we need it so if we wanted to do say predictions let me just instantiate this so let's suppose that i've got my model and i'm happy with it and so i want to [Music] save it so i can deploy it somewhere so let's just wait for this model to download okay good so then what i could do is i could save my model and this is just some path on your on your on your machine so if we now look inside the file system we can see that we've got a directory called directory on my computer so now if i have a look at what's inside that directory i've got these two files i've got this config json and i've got this like binary file called pytorch model and so what we can do now is we can take that folder and we can you know wrap it up zip it up put it on a machine and then if we want to get new predictions then what we do is we take our tokenized inputs we then feed those or convert them into a tensor because all the pie torch models expect to watch tensors and so if we look at this model inputs it's just going to be a tensor and then we feed these inputs to the model and then this is now what would constitute a prediction and then you can you know do whatever you want with that prediction maybe use it to make some sort of decisions or maybe use it to feed a dashboard basically the sky's the limit and that's more or less like sort of you know how you generate predictions it's pretty straightforward so let's have a look we've got a question here um out of interest how long would it take to train bert from scratch and can you do it on collab okay so i think if you it really depends on the size of the corpus that you want to use so for example bert was trained if i'm not mistaken on all of english wikipedia and a corpus called the books corpus which is sort of scanned library books and i think let me think so you know let's do something like this let's why don't we find the answer um so on the fly because i don't remember off the top of my head how long it took them to do it and there's nothing better than live reading papers okay so here's the book paper um and let's have a look at i'm guessing they use tpus okay so they say here that they trained uh bert base on four cloud tpus so this is 16 tpu chips and each pre-training took four days to complete so i think um from memory uh the cloud tpus you get on colab are just one tpu chip so sort of roughly speaking it would take you maybe 16 days 16 times four so 64 days to train on collab you know with the same corpus but i don't think so yeah i'm not sure if there's a quick bert training however i will show you something um uh there's a blog post by hugging face um let's see on training a model on esperanto so i'll chuck this in the chat so can i do that okay so uh this um this blog post it uses a slightly older api but the basic idea is to show you that you actually can train in a co-lab um a burp model as long as your corpus isn't too big so this is esperanto which is a special language um that is you know has much less text than english but i think from memory uh this was trained in just an hour and a half maybe a few hours so let's see okay maybe we don't see it here we just have to look at the collab um let's see so the training of this model okay so yeah this training took almost three hours um so it really kind of depends on the size of your corpus so in principle you can but if you want to do something that's like say as powerful as bert then you're going to need some some more serious hardware okay um so there's another question by i am homes i understand that transfer learning or using a pre-trained model is the way to go instead yes that's exactly right so um the the sort of real power of like transformers and um nlp sort of nowadays in general is that we don't really want to do pre-training ourselves because again it's expensive and uh time and takes a long time so i would almost always use a pre-trained model if if i can um the only time you might really be stuck is if you're dealing with like a domain that's very different from any pre-trained model that exists so for example suppose i was trying to train a model on like source code so uh you know in the early days of of transformers that there weren't any pre-trained models on source code like you know trying to for example understand python the language and so then you know using bert bass like on english and then trying to transfer to source code might be a bit tricky it might not give you very good results and so if you you know trained on a source code corpus that would give you better results and the other example where you generally need um to find an alternative is if you're dealing with like a language that is not one of the sort of commonly supported ones so um my understanding is that there's like many languages for example in africa which um aren't really represented highly in wikipedia and so then this is hard for people to train models uh or train transformers on and then you typically need to do some sort of tricks to like take something that is like multilingual like a multilingual version of bert and try to somehow adapt it to your your language but these are generally um you know more advanced things that um we can talk about later okay so um let's see so where were we we um have looked at how we can ah another question can we change the config parameters of a pre-trained model and use it yes but with some caveats so for example um let's think about what can we change and what can't we change so i want to make sure i don't say something silly so let's have a look at the model config we have here so this is the um this is the config associated with burp bays and here you can see that there's a bunch of hyper parameters that were associated with the pre-training of this model so for example let's see so i have a suspicion that if we change many of these things we're going to break the model in a non-trivial way however let me think what happens if we change the number of hidden layers so you know what let's let's try the usual way of doing things in deep learning is just to try so um i'm going to try to change so so bert has a number of attention heads so i'm going to see what happens if i reduce the number of tension heads from 12 to 6 let's see if this works so let's have a look at the config to make sure that worked so now we've got attention heads six now what happens if we try to feed some inputs to this model okay okay so interesting okay so it seems that we can change the config and things work in the sense that we don't get errors but i have a suspicion that like hacking into this in a pre-trained model would affect the the kind of performance in some non-trivial way because if we think about um like what happens when we do something like text classification we're taking the whole like base model of bert and then we're just stacking on top of this um the classification head and if i start kind of like you know doing an opera like dissecting burt into pieces or something you know reducing the attention heads or changing the number of transformer layers so burt has 12 encoder layers i have a suspicion that i would probably have some sort of non-trivial or negative impact on the on the downstream task like classification that i want to fine tune on but um maybe omar has a has a different insight here okay so that's a good question i i've actually never hacked into a pre-trained model this way um you know maybe you could try and see like uh do some experiments like what happens if i completely change the number of layers the number of tension heads invert and to try to do classification like sentiment analysis do i get better or worse performance i have a feeling it'll be worse but it'd be a cool thing to check and if you do check please share it on the forums okay so that was the um look at sort of how we generate predictions let's now have a look at uh the tokenizers in more detail so let's cross our fingers that the internet still works okay in the next few videos we'll take a look at the tokenizers in natural language processing most of the data that we handle consists of raw text however machine learning models cannot read or understand text in its raw form they can only work with numbers so the tokenizer's objective will be to translate the text into numbers there are several possible approaches to this conversion and the objective is to find the most meaningful representation we'll take a look at three distinct tokenization algorithms we compare them one to one so we recommend you take a look at the videos in the following order first word based followed by character-based and finally subworth based okay so let's get out of this okay so that was like a high level overview of what we're talking about that there's this general process we have to go through of converting text into into numbers um there's a bunch of videos in this section that you can look at which show the different types of ways you can tokenize text am i can you guys see me or not can you okay good great all good yeah the the joys of home office okay so um what i was saying is the um there are different approaches or strategies you can take for tokenizing text and the advantages and disadvantages of them just depend on the on the application you're interested in so i'm not going to go through the videos you can watch these yourselves um but let's just have a quick look at the the sort of three most popular approaches so the sort of first thing i might imagine is if i've got like a text like jim henson was a puppeteer then what i might do is say okay i just want to split this text into words and in english a simple like trick to do that is just to split on white space so most of the time in english if there's a white space that's the boundary between words and then this would convert for example jim henson was a puppeteer into these five tokens so in this case a word is a token but there are like several languages where this is like a terrible idea so for example if you have ever learned japanese you have characters called kanji and these kanji don't have any words but any space it's just a sequence of kanji and in general they're actually not even written from left to right they're written from top to bottom so doing this kind of splitting or tokenization in terms of white space just wouldn't work and so an alternative approach is to try something called character based so this would be like imagine you just split every letter in an english sequence into its own token and this would actually be then quite good for japanese because every character is a kanji character which then you know we could represent with a token and so the kind of thing you can see here is that it really the sort of tokenization strategy seems to really depend on the language that we're studying and so the thing that like a lot of research has gone into is trying to find something that gives you like a good trade-off between these two kind of extremes of word tokenization and character tokenization and maybe i should also mention a couple of drawbacks before we go into that so one of the drawbacks with word tokenization is that this will create a vocabulary which is the size of the number of words in our language so basically if we have imagine we just tokenize english then we will need a token for every single word in the english language and this um this is generally huge it's going to be several hundred thousand tokens which makes it very like computationally expensive um but the other thing that's kind of not great about this is that it doesn't make any sort of distinction between like um like i don't know dog and dogs um which are kind of like you know similar words and we're kind of representing them now with two independent tokens so that's the drawback with the word ones and the character based ones have the drawback that um the the model has to basically learn what a word actually means because the only thing it gets now are characters or gets character tokens and then it has to figure out over over training that okay if i put together these characters in this order this seems to represent like a more um abstract object like a word and so this at least for english would be not a great strategy so most tokenizers they use something called sub word tokenization and the basic idea is that instead of like just splitting on word boundaries or on characters you basically split or you decompose a word into sub words and an example here is like let's take the word annoyingly so annoyingly can be represented as maybe two sub words annoying and lee and then what we can do is we can just kind of collect the frequencies of these sub words and then use this to figure out basically what i like the most frequent subways in the language and then we can use those sub words to build back the the full word itself so if you know that you've got annoying and lee you can then reconstruct annoyingly from these two components and so like i guess there's an example here you can sort of split let's do tokenization into these sub words so you can see this is kind of a mix of a word token with a sub word ization and we've also got the exclamation mark being treated um as its own separate token and the sort of most common tokenizers that you would see oh there's a good question i'll get to that are things called wordpiece which is the one that bert used or sentence piece which is the the one that gpt and the gpt models typically use so there's a really good question how do you design the subword boundaries is it manual um so this is um more or less determined by the algorithm that you choose to use and um i think like in general it's a mix of like manual rules and also learning a form of learning from the corpus so let's have a quick look at um let's see i think it's the sentence piece paper so this is i'm going to put this in the chat okay so this is one of the most famous um uh papers on on tokenization and let's have a quick look at so how are these boundaries okay um yeah that's right so that's what i remember from this paper so they say that historically um most like tokenization um uh algorithms they were they used manual uh rules and the problem with this of course is that for every language you need your own set of rules and it's a real like pain to um to sort of uh maintain and extend and so if i'm not mistaken um sentence piece is is kind of like a learned tokenizer so you actually have like a sort of optimization objective and then you train this like you train a model and so by training this on your corpus you actually learn the the word boundaries um but i haven't read this for for a few years and i might be forgetting something but uh yeah that's a good question and i think maybe something that we can um add in a future version of course okay so where were we so we were looking at these different tokenization strategies so let's maybe uh look at the the colab um so one of the things i often like to do is to sort of capture the outputs in my pip installs on collab so i don't have this humongous mess of installation okay so what you can see here is what we were talking about before this is if you just do wordpiece tokenization oh sorry uh word splitting into words and now we can have a look at like um what the the bert tokenizer does and there are two ways you can do this in transformers you can specify the specific class that you want to use for the tokenizer and this is if you you know happen to be maybe doing something very specific and you just i really want to make sure you get the bird tokenizer but the thing that i personally use all the time is just the auto tokenizer because this will automatically convert the the tokenizer into this class anyway so if i provide a checkpoint and it can identify that it will then automatically load it this way okay so if we take a tokenizer um it converts uh the text into these input ids um but now let's have a look at something here so why are we doing this twice okay okay good so what we're doing here is we're just taking a sequence of text and then we're extracting the tokens as a list and so you can see here that um in the case of bert which uses this word piece tokenization algorithm the way it figures it distinguishes like words from sub words is using this uh double hash symbol so you can see here that in the vocabulary of the tokenizer um it has learned that it's good to split words between trans and everything else and if we wanted to reconstruct um these two words we just need to know that this double hash means that this former belongs to trans to build transformer and so one way you can reconstruct the sentence is you can take your tokens and you can convert them back into input ids like this so this will create these ids and then you can decode these these input ids to build back the original string okay um another way you could do this is let's have a look where we have our inputs where are we okay so another way you could do this is if i take my tokenizer and i just tokenize my sequence then this produces what we saw before um and then what i could do is i could go tokenizer dot decode i put my inputs and my input ids and this should return what we saw before and now you can see the difference between this approach and the one here is we don't have these uh special tokens so if you don't want these to be present i think we can do skip special tokens true and then this will give us back the original sequence cool so that's um more or less like a sort of deep dive into the tokenizers um maybe one thing to mention uh let's have a look at a different tokenizer so you get an idea of what you might also see so let's find a gpt model that is not going to blow up the collab so gpt2 let's do maybe this one so i'm going to just take a tiny gpt you can also copy the the name of the checkpoint which is quite handy so we are here so what i'm going to do is i just want to show you the difference between the gpt model and the way it tokenizes so hopefully this works yeah so gpt has a kind of very quirky um tokenizer where um it uses this weird symbol it's like a g with a little like dot on top of it and this is what it uses to indicate um that there's a white space between this token and this one so you can see that it's saying okay using and the
Original Description
This is a recording of the twitch session on June 23rd 2021.
Chapter 2 of the course: https://huggingface.co/course/chapter2
Have a question? Checkout the forums: https://discuss.huggingface.co/c/course/20
Subscribe to our newsletter: https://huggingface.curated.co/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from HuggingFace · HuggingFace · 39 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
▶
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
The Future of Natural Language Processing
HuggingFace
Trends in Model Size & Computational Efficiency in NLP
HuggingFace
Increasing Data Usage in Natural Language Processing
HuggingFace
In Domain & Out of Domain Generalization in the Future of NLP
HuggingFace
The Limits of NLU & the Rise of NLG in the Future of NLP
HuggingFace
The Lack of Robustness in the Future of NLP
HuggingFace
Inductive Bias, Common Sense, Continual Learning in The Future of NLP
HuggingFace
Train a Hugging Face Transformers Model with Amazon SageMaker
HuggingFace
What is Transfer Learning?
HuggingFace
The pipeline function
HuggingFace
Navigating the Model Hub
HuggingFace
Transformer models: Decoders
HuggingFace
The Transformer architecture
HuggingFace
Transformer models: Encoder-Decoders
HuggingFace
Transformer models: Encoders
HuggingFace
Keras introduction
HuggingFace
The push to hub API
HuggingFace
Fine-tuning with TensorFlow
HuggingFace
Learning rate scheduling with TensorFlow
HuggingFace
TensorFlow Predictions and metrics
HuggingFace
Welcome to the Hugging Face course
HuggingFace
The tokenization pipeline
HuggingFace
Supercharge your PyTorch training loop with Accelerate
HuggingFace
The Trainer API
HuggingFace
Batching inputs together (PyTorch)
HuggingFace
Batching inputs together (TensorFlow)
HuggingFace
Hugging Face Datasets overview (Pytorch)
HuggingFace
Hugging Face Datasets overview (Tensorflow)
HuggingFace
What is dynamic padding?
HuggingFace
What happens inside the pipeline function? (PyTorch)
HuggingFace
What happens inside the pipeline function? (TensorFlow)
HuggingFace
Instantiate a Transformers model (PyTorch)
HuggingFace
Instantiate a Transformers model (TensorFlow)
HuggingFace
Preprocessing sentence pairs (PyTorch)
HuggingFace
Preprocessing sentence pairs (TensorFlow)
HuggingFace
Write your training loop in PyTorch
HuggingFace
Managing a repo on the Model Hub
HuggingFace
Chapter 1 Live Session with Sylvain
HuggingFace
Chapter 2 Live Session with Lewis
HuggingFace
The push to hub API
HuggingFace
Chapter 2 Live Session with Sylvain
HuggingFace
Chapter 3 live sessions with Lewis (PyTorch)
HuggingFace
Day 1 Talks: JAX, Flax & Transformers 🤗
HuggingFace
Day 2 Talks: JAX, Flax & Transformers 🤗
HuggingFace
Day 3 Talks JAX, Flax, Transformers 🤗
HuggingFace
Chapter 4 live sessions with Omar
HuggingFace
Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker
HuggingFace
Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker
HuggingFace
Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker
HuggingFace
[Webinar] How to add machine learning capabilities with just a few lines of code
HuggingFace
Hugging Face + Zapier Demo Video
HuggingFace
Hugging Face + Google Sheets Demo
HuggingFace
Hugging Face Infinity Launch - 09/28
HuggingFace
Build and Deploy a Machine Learning App in 2 Minutes
HuggingFace
Hugging Face Infinity - GPU Walkthrough
HuggingFace
Otto - 🤗 Infinity Case Study
HuggingFace
Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it
HuggingFace
Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models
HuggingFace
🤗 Tasks: Causal Language Modeling
HuggingFace
🤗 Tasks: Masked Language Modeling
HuggingFace
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
When AI Asks for More Electricity Than a Country Can Imagine
Medium · AI
You Are Not Behind. The World Is.
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
🎓
Tutor Explanation
DeepCamp AI