Talks # 3: Lorenzo Ampil - Introduction to T5 for Sentiment Span Extraction
Key Takeaways
The video discusses the use of T5 for sentiment span extraction, a text-to-text approach to NLP problems, and demonstrates how to fine-tune the model for this task using PyTorch Lightning and the transformers library. The speaker also covers the benefits of using T5, including its ability to perform multiple NLP tasks with a single model and its state-of-the-art results in various NLP tasks.
Full Transcript
okay so hello everyone and welcome yet another time to the third episode and again a very special episode so this time we are going to learn about t5 from lorenzo and he is a machine learning product manager and data scientist at thinking machines which is a ai consulting firm with operations in singapore and manila and i'm very thankful to him because it's too late for him right now and he he would like to go to sleep but instead he's doing this for us and i'm very thankful for that so he has a lot of interest in machine learning and in nlp and he's working in various different industries and prior to this he set up his own consulting practice where he provided end-to-end data science solutions for finance and tech companies in southeast asia and australia and she has also worked as at uber as an analyst where he has handled a lot of projects related to nlp analytics and automation for the asia pacific region for the asia-pacific region and in this talk he's going to talk about t5 and i'm pretty excited i saw some of his solutions and not the solutions but like some example output and i was i was super happy to look at them and i i i was thinking of working on t5 a couple of weeks ago but then i saw a tweet from lorenzo and i thought okay i would rather learn from the expert than try to be an expert right so he had already worked a lot on t5 so uh one more thing is you can also come and give a talk in talks you just have to fill out the form and you can find it in the description box and uh now i'm going to um now over to you lorenzo i think it's all yours now okay sounds good thank you so much for that um introduction abhishek i'm just making sure you guys can hear me uh okay hi guys i'm enzo i think we can get started let me just share my screen yeah so hi guys so let's get started with um this talk on t5 for sentiment spam extraction um so before we get into the code i actually prepared a little um introduction for you guys so i actually have three goals for this talk um it's not just to show you guys a solution that you could use your cap for gaggle um so i have three three goals so first is to just give you guys a high level introduction for t5 and how it works you know some of the plumbing behind it some of the concepts around it what makes it novel what makes it interesting and from there i was hoping to share some thoughts that i have around why i think it's significant for the future of nlp and why i think um you know ai in general for you know natural language processing it'll really be impacted um in practice because of this new technology uh and then lastly um as you see in the title i'm going to be illustrating to you guys how to use t5 for sentiment span extraction so okay let's get started so for a quick overview for t5 so basically t5 is a recently released model that reaches state-of-the-art results by solving nlp problems with a text-to-text approach so when you say a text-to-text approach what you're what we really mean is that um every nlp problem it tries to solve it using text as input and text is output and this was introduced in a recent paper um published and it's entitled exploring the limits of transfer learning with a unified text transformer so why am i talking about t5 with you guys right now so the reason why i've been working on t5 recently and really been interested in it is because after i did the first read of on the research paper just a few weeks past actually okay sorry yeah so the reason why um i've been really interested in t5 is because of two things it's really one is the text-to-text um framework that it uses i really see it as a universal interface for nlp tasks um and it's this paired with the increased potential for multi-task learning um i think it really makes multi-task learning a lot easier and a lot more accessible and because of that i think it's going to be used more in practice because of this new model so let me just elaborate a bit on that so just for some key points from the t5 paper so we have this illustration that actually came from the paper itself um and the first key point i want to discuss is that you know it treats each nlp problem as a text-to-text problem as i already said um and the illustration actually um makes this a bit more clear so for example so you'll see multiple boxes connected to t5 the model and each color is a different task so for example the first one translate english to german that is good the output is good which is the actual translation next we have a cola sentence so and then the sentence is the course is jumping well for those that don't know a cola sentence is basically this task where the objective is to um identify if the sentence is grammatically correct so in this case it's not acceptable because courses don't jump um another task over here is an example is a sts sentence so the idea is given two sentences identify a given number that indicates how similar the meaning of the two sentences are in this case it's 3.8 over five so you know and and lastly we have a summarization a classic summarization case so you'll notice that for all of the inputs the inputs are text and for all the outputs it's text even when the output is a category and even when the output is a is a float is a number in itself so that's what's really interesting it turns every problem into a generation problem from an nlp standpoint standpoint and so this brings us to the next aspect which is it's a unified approach to nlp deep learning so when i say unified approach what i mean is that for different tasks you can technically use the same architecture for t5 um in most models when you have five tasks you typically have to train five different models or um in recent times what you would do is you would use the same pre-trained model let's say bert but you still have to create a head architecture that's specific to each of the five models um and so you would have to train five separate models and you would have to create an architecture head for each of them and that takes some time but then with t5 you can actually use the same architecture across all those five tasks um and that means that you don't changing the architecture as much anymore because the structuring of the task um can be reflected solely in how you structure the text in example in this case you specify translate english to german so that's really cool it really lowers the cost of training the new tasks um and lastly i put over here um just under this this section is multiple nlp tasks can live in the same model so just to clarify this is not the first um model that has been taught multiple tasks but it because of this texted text framework it is now a lot easier to teach one model to do many things um because of the text text approach so that's really interesting so if before you needed five models now you can have one and it's performing all of those tasks whether it's translation summarization or you name it semantic similarity all those kinds of things um see i just wanted to show here so this is a chart showing no just a table showing um how t5 actually reaches state-of-the-art results although uh more of for its largest version so so you'll see in the leftmost column these are the different versions of t5 depending on the the number of parameters and therefore the size of the model so you have the small base large uh 3 billion 11 billion and it's the 11 billion one in bold that actually reaches a state-of-the-art uh what we're going to be showing today is uh more of the usage of t5 base since this is a more practical model to be using with a standard computer with you know not the most state-of-the-art gpus but still you know an acceptable gpu so next just some details some other details around the model um the data set that was used is called colossal clean crawled corpus also known as c4 so it basically contains 750 gig of clean english text script from the web um and there's actually this corpus called common crawl and it's basically one month of data from this and it's really a lot of data um and even if you compare it to previous models like bert and excel net it's really a lot um yeah so it's really a model that benefited from a lot of data and we're seeing that pattern more and more now lastly just a bit about its pre-training objective which is you know a standard thing with transfer learning so it uses a simple denoising training objective um and this can be illustrated with the picture below so if you have um a sentence that looks originally like the blue text so you'll see it says thank you for inviting me to your party last week so how we turn this into a pre-training task is we convert it into an input where we mask some of its tokens um and some of the tokens can also be beside each other in this case we mask for inviting and last and we um we specify an id for each of those masked tokens we call these sentinel tokens by the way so it turns into thank you x me to your party y weak so you see we've we've masked those words you can't see it in the input so idea is that given this input we are now trying to predict the output which is each sentinel token followed by the masked tokens um token or token so in this case x followed by for inviting last so in this case it's actually a pre-training objective that is also generated in nature so given this input output this text that's what we're doing and you know imagine applying this to 750 gig of of data from from from the internet that's how pre-training works here and finally just last key point is um the architecture is actually really interesting um and also not something super new so the transformer architecture has been a really important uh piece of work in the past few years for the development of natural language processing in the past few years and with architectures like bird gpt-2 and what's interesting about this architecture is that it's using the whole transformer architecture so what we mean by that it is it includes both the encoder and the decoder so this overall um diagram i'm showing here is the overall transformer on the left in red is the transformer encoder which is um what is being used by bert actually um in its pre-training and and with the overall architecture it uses layers of transform encoders on the right you see the transformer decoder which is actually used by um the gpt2 architecture so you'll see that you know typically models use one of these but then with t5 it's found that it has great results using both of them so that's also something that's quite interesting so with all of these key points um the key insight for me really and i'm just repeating this from earlier but this is what i find amazing about this model is this idea that you can have multiple nlp tasks learned by a single model and in a very low cost way because of the text-to-text format associated with it so that's a re that's a key insight for me and because of that um the expected impact i have for the future of nlp because of you know this model and future models that have the same pattern is that multi-task models like t5 um they will cost like these kinds of models will cause lower training times like like lower time spent on experimentation training models it will also mean lesser compute and lesser storage costs so lesser storage costs why because because you have if you can do more tasks with less models then you need storeless models less compute because you need to train less architectures as well so i think yeah that's that's something i find really interesting about t5 so from here we're done with introduction in terms of the high level overview of t5 i hope you guys learned quite a bit from that to give you some context now we can go straight to how do we use t5 um for actual sentiment span extraction so just some overview notes on that so as most of you probably know by now this data set comes from kaggle um it's a competition called tweet sentiment extraction it's linked over here um so what's interesting is that most of the existing model implementations you'll see in kaggle kernels use some sort of token classification task so this is where um what the model what we're doing with the model so let's say it's a bird model and we use a token classification head on the bird model what we're typically doing with the tweet is we assign a score to each of the tokens and the scores will eventually represent the probability that that token is the beginning of the span that contains the sentiment and the end of the span that contains the sentiment so that's typically what you're doing you're predicting essentially the index the beginning and end index for token classification for for this extraction kind of task but for t5 the approach is purely generative kind of like a classic language modeling task um also similar to you know summarization translation so i just for that i find it's actually quite interesting so the span is not extracted but actually generated given the input so yeah um so i for me that's what makes it interesting so we can get started with the code um and just to give you guys a heads up um there's gonna be a lot of code i'm gonna be showing i'm not going to be ex be able to explain everything um but in the cases that there are functions that i feel are relevant i will give some idea of what they're doing sometimes i'll go in a bit more detail but sometimes i will just explain it at a high level but not move forward it's okay you guys can ask me in the q a later um if you want more context or um i'm gonna be sharing this notebook as well after this talk so you guys can go ahead and read this and you know dig deeper into the code that we've written so first we start off with installing the necessary packages so google collab so you guys probably seen by now that we're using google colab and they have a lot of deep learning related packages already installed by default so the only additional ones we need to install is the transformers library and pythor's lightning so install both of those after we install um we just get the data from kaggle um won't go into detail here um this should be quite familiar to you guys um yeah so it's just authentication don't worry none of my treads are here make sure this is clean but yeah so we just get the data from kaggle and from here we can start setting up data right so we start by importing our standard numpy and pandas um and our classic train test split from sklearn love using that and we read the train and test set so right after reading the train and test set i just quickly apply the train test split um and you'll see that the proportion i apply to the train set and remember we do this to get the validation set so that we can have some idea of how our model is doing even without submission so we look at the test if you look at the test size you'll see 0.13 and the reason why we chose 0.13 is that it gives us um a number of examples for the validation set that's quite similar to the test set so yeah so that gives gives me some confidence that okay whatever scored is it will be somewhat representative for the test set um given that it's a random sample so there you'll see we have 24k for the train set 3.5 for the test and then the validation as well just making sure there's no overlaps just doing some we had a technical issue but we are back now so thanks to everyone who is still here and i hope it won't happen in future okay um all good thanks for getting it back so quickly have a check okay guys so back to the train validation set so you'll see that um we have our train test and validation set now um and so for the following cells i'm just making sure you know some some sense checks making sure that there are no overlaps with the train test validation um you know just printing to make sure that things make sense um and now at this step um i've decided to just check out how the data looks so just to give you guys an idea of how the data looks especially for the people who are not necessarily familiar with the kaggle competition so you'll see here we're printing the first 10 examples um and i'm printing i'm printing two things so the first thing i'm printing is the sentiment of the tweet and the second is the tweet itself so actually what we're doing here is given the sentiment of the tweet so in this case negative it's basically the task is for us to find the phrase within the tweet that is indicative of the sentiment that was specified so in this case it's negative and the tweet for example here is how did we just get paid and still be broke as hell no shopping spree for me today so that is an example of a tweet and so if you go look below in this next section um i actually printed out the first 10 um target answers as well so in this case the prediction is brooke's health which makes sense right so how did we just get paid and still be broke as hell no shopping spree for me today and the negative phrase is broke as hell in this case um it also makes sense there's an exclamation point there's a question mark that's usually indicative of some anger or some frustration from some from someone we maybe we can look at one more example so if this look at the third one it's now a positive example so in this case the sentiment is positive and the tweet is i love when my ipod shuffles so all the good songs are all together so in this case um the answer is actually love which is quite intuitive right so yeah so that's what it's doing for each of these um so i hope that task is clear so this is how the data set looks so this is the input again this is the input the tweet and the sentiment corresponding to that tweet and the output the thing we're trying to predict is what's the phrase within the tweet that corresponds to the to the sentiment that's specified for the next few cells um these are just a bunch of sense checks around on the notebook do we have a gpu yes um we have this fancy p100 thanks to google collab thanks so much yeah so in the next section you'll see that the title is format data to input text q a format so um initially that might be a bit confusing because remember our task is to do sentiment span extraction and all of a sudden we're formatting the input text in a q a format um so let me just give you a bit more context than that so here in this notebook i call it t5 exploration so what i'm really doing is i'm using t5 for some sample tasks that it has already been pre-trained on so if you remember from the introduction you would have seen that squad was one of the data sets so that's actually the standard the stanford question answering data set um and what it does is actually it has this format so look at the q a input so the first section is really um basically it says question colon what does increased oxygen concentrations in the patient's lungs displace and the next part after the question is what you call the context colon and this is really the article and the assumption is that this article contains um the phrase or the word that corresponds to the answer to the question that was previously mentioned in this case so in this case we ran the forward pass here and the answer was carbon monoxide um so actually the interesting thing about this is that q and a in this context is technically span extraction because the answer to the question is inside the context i was thinking when i chose q a was oh since the q and a task within t5 that t5 was already pre-trained with since it already knows how to x sorry since it ah it's a bit hard to see okay here so the thing about q a is that since it already knows how to extract spans why don't we just specify our inputs in the same way that a q a task is specified so that's exactly what we do here but instead of asking a question why don't we use the sentiment itself as um the question itself and then just to indicate what is being asked so in this case so for example question um this is going to be let's say positive so it's so we're implicitly asking oh like what is the section that's positive given like we're asking for something that's positive and given that context which is the tweet itself um hopefully we can use we can fine tune the model to be able to find the span within the context that has um the phrase that corresponds to the input sentiment so we apply this to our train set our test set our validation set um and here we convert it into a new line separated string because we're going to save these as txt files in just a bit um so this is an example of how these should look so this is an example of how the input would look so for example here question neutral right so assuming that uh we use neutral um as the question and the context is i'd have responded if i were going um in this case um the answer is the whole span which is actually typically the case when um when the sentiment is neutral so just go down and so we do the same things going down yeah so these are more examples so again so this is how we specify it so question negative context how did we just get paid and still be broke as hell so it's the same example but then we formatted it in the same way that we would format a q a task with t5 based on how it was pre-trained so just keep scrolling down so let me just zoom out a bit just keep scrolling down we basically do the same thing for the train set the validation set and the test set um and here we're just doing the same thing for the target so the target is what we're trying to predict remember so remember the answers book as hell the only additional processing we did was we added an end of sequence token um and you don't really have to worry about that it's more of a formatting thing it allows us to indicate to the model that um this is the end of the text that we're trying to generate so yes there we have the data then okay so there so now that we have the txt files that have our input data we can finally get started with some of the aspects of the code that should be familiar to pytorch people so here we're going to be preparing the t5 dataset um and the idea here is that remember we have the txt files which are the data files that we have so idea is we want to turn those we want to read those files and read them into a t5 data set so the first main function um that i defined here is called encode file so what it it's really doing is that given a tokenizer um and given the path to the data and given you know some parameters like max length and pad to max length what we're doing is we're converting we're reading those txt files and converting the examples into a tokenized format and so after you input the file it reads the file and it returns it as a list of tokenized examples so that's what it's returning over here and then from here we can define the t5 data set so how the d5 dataset works is it's quite similar to your you know your regular pythons data set um the key thing to remember is that we use the encode file function specifically so our arguments for the constructor is just you know the tokenizer the data directory the type of the path whether it's a it's a trade it's a train file it's the validation file or if it's a test file and some parameters related to the source and the target like for max length um and for the source and the target um so for the constructor you see here we're reading in the tokenizer the type path we're using the encode file to read in the files as tokenized examples and we have some special cases in the case that it's a test set but that's not so important right now so yeah so from here if we look at the get item method this just returns the data into a format that's actually familiar to us so it returns it as a dictionary where um there are three things inside so you have the source ids so these are the tokenized inputs um in id format we have the source mask um so if you're familiar with transformers so we typically have masks so that the transformer knows to ignore the padding on tokens and finally we have the target id so this is what we're trying to predict and so with these inputs we should be able to calculate the loss and you know train the model there are a bunch of other functions in the dataset class as well um just we'll go through them quickly so we have a trim seek to seek batch so that just makes sure that whatever um batch that we return um we take out any columns that are just pure padding um and also we have a collate function over here um interestingly the collate function in pythagoras doesn't really have to live in data set class but in this case we just did it to to make it convenient to call later um so what the collate function does is it's basically additional processing to be applied to batches as they're yielded from the data loader later so if you remember for pythor's users um the end um data set well the end format of your data is in a data loader format and at that state um everything that's yielded per iteration it's it's returned in batches so this function just processes those batches one more time before they're yielded so in this case it's the same thing as earlier we have your source ids the source mask and target ids but now in batch format now we can go straight to the model this is where it's going to get a bit longer with in terms of code but just bear with me so we start off with importing our standard packages we also have a set seed function to make sure that whatever we do is replicatable we also have an extra jacquard function so if you're for those familiar with a kaggle competition this is the evaluation metric for the competition so i just defined this function so that at the validation step later we can also calculate uh did your card score so then here we have the familiar um t5 module um but instead of inheriting from the typical and in that module we're actually using something called typeforce lightning so it's essentially the same um pytorch lightning is is really just the style guide for pytorch um and the only difference is that most of the most of the the functions related to training validation and testing um they all live in the module class now so you don't have to write anything outside of the well mostly you you have to write a lot less things um after the t5 module class so let me walk you guys through it so at the constructor stage um it's just pretty much similar to um your typical nn.module for pytorch so you have your hyperparameters as an input um you have your config files so for those not familiar so for hugging phase transformers so each model has its own unique configuration um we also have our tokenizer so we'll set this as t5 later and we're also reading in the model so we use an auto model with lm head so this is just a class um that allows you to read any model type with a language modeling head um in this case we're going to use we're going to be reading t5 later as well so we've read the model and finally we also have arguments related to the data set so max source length and max target length so from here you know so the initial things you need for any module we pretty much have it um the next standard thing we need in a module is the forward method so this um is basically defining how we want to be doing the forward pass um in this case since we already read the t5 model straight from hogging phase transformers um the forward pass is as simple as a self.model um and the inputs are the inputs that we've seen earlier the input input ids tension mask now we have decoder input ids and lm labels so lm labels these are our targets so this is how we do the forward pass so now that we know how to do a forward pass um the next step is data preparation so as i said earlier um in pytorch the end state of your data really is the data loader so what the get data loader function does is given any um t5 data set um so for the type of um data you have so the type path so in this case so let's say it's a train uh whether it's the test or where is the validation you read that in as a t5 data set this was defined earlier and after that you convert it into a data loader which i as i've said earlier the only difference is now we have a specified batch size we have the collate function which does the processing for each batch and we can shuffle it if we want to and we return that data loader and it's as simple as that um just us generating the data loaders um for each of the data inputs that we have so whether it's our training set our validation set in our test set so this is for our training data loader so we're outputting the data loader it's just a bit longer for this one um simply because there are additional things that we have to specify um for training so in this case we have the scheduler as well so for our validation data set um it's the same thing we get a data loader for the validation set and then for the test data loader we get a data loader for the test set and there so that should pretty much allow us to handle our data the next step once you have the data and after you know how to do a forward passes you'll want to configure your optimizer so the optimizer is really just um what dictates um how the weights are adjusted after you've calculated the gradients so in this case it's actually quite simple um we just read the model we specify which parameters to apply weight decay to um i won't talk about weight again detail but basically it's a form of regularization and we want to apply it for some of the parameters um and finally we instantiate atom w as our optimizer using the parameters using the learning rate and item epsilon so from here we have the optimizer we save it into the object and we return it so from here we should already know how to update our weights given the calculated gradients so okay great so at this point um we're close to the end of the module but really the next step is really how we perform the forward passes for each of the training steps um so there's a generic step method here so what it's basically doing is i mean i put in the docs string it just runs a forward pass and calculates the loss per batch um so we apply this for training step and validation step so as you can see um so we do some some processing here we read the data from the batch um and then we run the forward pass at this line we get the outputs and from the outputs we get the loss right so essentially what's happening here is we run a forward pass and we calculate the loss and this step method we apply it for the training step so the training step is essentially like each iteration of the training loop so yeah so for each iteration of the training loop we use a step function so we calculate the loss and then we return the loss into this dictionary so um for pythor's lightning you don't really have to worry about what happens after this dictionary um as long as it's formatted in this way it gets handled like automatically later um so there so now that we know how to do the training step um we also have the optimizer step this is um literally the step where we for each um batch after we run the forward pass we do an optimizer that step um which is where you know we adjust the weights based on the calculated gradients um we also do a dot zero grad just to refresh the gradients and there we also do a step we update the learning rate scheduler so that's what happens at each optimizer step so there so now we've already we already know how to do a forward pass for each batch we know how to cut we know how to calculate the loss and based on the loss we've calculated the gradients and then finally we've also adjusted our weights um based on the calculated gradients based on our specified optimizer um so from here like it gets pretty similar so we apply the same step kind of approach to the validation um so really what's happening for the validation step is we're calculating the forward pass and um so we're running the forward pass and calculating the loss it's quite similar to earlier except now we're also calculating the jacquard um and to do that we have to do a prediction that generates actual text um but yeah so that's the only difference with the validation stuff we're calculating an actual jacquard score um and finally in the validation end so after all the batches are finished what we're basically doing is we're getting all of the losses for each of the batches so we're getting a validation loss we're getting the jaccard score for each of the validation batches and we're averaging those out and so that for for this epoch or at the end of training this is what we consider as um you know the validation jacquard score and the value validation loss so there so um i won't go into detail with test it's still quite similar except for the test case um we're literally just running predictions we're not calculating loss because um as most of you would know in kaggle we uh we don't have the targets um for the test set so there so yes we do a forward passing test set and we um save the predictions into a file after so that's pretty much it for so after this you would you should probably notice that most of the steps in the module should be pretty much um should pretty much be dealt with already so the remaining functions in this in the t5 module um they're really just extra methods um some additional configuration things so for here tq dm dict it's just telling us oh what are the things that we want to be printing um and a bunch of other stuff so here we have another method called add model specific arguments but here it's really just this is specifying what are the arguments we're going to be using as input later so we have the model name so we're going to set this to t5 later the config we're going to set t5 tokenizer as well um you know you see these learning rate decay weight decay so a bunch of these hyper parameters that we could be setting later before we start training um and lastly we also have arguments related more to you know the generic aspects of training so you know whether it's the directory of your output you know the number of gpus you use etc so yeah those are just some specifics there so from here um it becomes really simple um so in python's lightning there's something called a trainer so the idea of a trainer is given the module that we've um specified above that we've defined it basically glues all of those steps together so that it trains automatically so in the generic train function over here it just takes the t5 module as an input and it uses it it sets the parameters and then it instantiates the the trainer using the training parameters that are used as an input right and from the instantiated trainer you can now fit the model just with a simple trainer.fit model and that should start running the model um so there so now that we've done that we know how to train a model we we've specified the module in a comprehensive way we can now fine-tune the model um so for fine-tuning i just have this main function where everything is um abstracted at a high of really high level so what's really happening is given the arguments that you use as an input you instantiate the t5 module so now we have the model um and then we train that model by using it as an argument into the generic train as uh which is the function we defined above um and then after we train it we saved the pre-trained model and we also saved the tokenizer um and then if you want to predict we can also predict on the test set so that's that's pretty much it so so here i'm just running the main code and so this is how the output would look um this is really how it would look for well one of the cool things about pythor's lighting is you can actually have a sneak peek of like how the architecture looks um for t5 just by scrolling down it's going to be kind of long but yeah these are all of the layers it's really a lot of a lot of the layers in your t5 yeah so you see you know for each epoch there's a loading screen and you know we see the losses for each step um but then why look at logs when you can use charts right so um okay maybe just zoom in a bit so something you can easily do now is you can use a tool like tensorboard so you can use it to start observing you know how um the scores progress over time so you see here the jacquard score going up um you also see the training loss go down although like sort of sporadically um and you'll see how the validation loss goes down for each of the epochs so yeah so given that um so that's it pretty much in terms of training um so now that we've trained the model uh i've actually um saved it in a gcs bucket um for safekeeping in case we want to do a forward pass later and so here i'm just going to show you guys like how inference would look for the t5 model that we fine-tuned on sentiment span extraction so the cool thing is that if we want to do a forward pass you only have to um technically import two things from the transformers package it's just t5 tokenizer and t5 for conditional generation um and so here we're just reading the tokenizer so we have the t5 tokenizer over here and we also um since we have the pre-trained t5 model we can just do a t5 for conditional generation dot from pre-trend in the current directory and now we have the t5 model as a t5 for conditional generation object so here i just define a simple function called getspan and so the idea of this function is that you input text in the format that we specified above in the q a format and what this does is it just spits out the predicted span so literally what we're trying to do with the kaggle task so okay so i have a few examples here with with the sample answers already so we can go through a few of them so for example here so you'll see um so we run the getspan function and the input is question negative so the the tweet the the sentiment of the tweet is negative and the context is i'm in va for the weekend my youngest son turns two tomorrow it makes me kind of sad he is getting so big check out my twipics so yeah so and then the output is it makes me kind of sad right so it it actually makes sense so if the the sentiment is negative it makes me kind of sad is actually it's a pretty reasonable um answer like i would have probably have guessed that as well and so just going through a few of them again so this one recession hit she has to quit her company such a shame and the answer is so the prediction is such a shame um maybe you can look at the last two so in this case the question is i'm given that it's positive so what's the span within on the monday so i won't be able to be with you i love you so it actually outputs i love you over here so it's the same thing for the last span so i liked it did you record it yourself so you have a very soothing voice um the predicted span is i like it so yeah so here you can see it's working so um that's that's really exciting when i saw it working um and so just sharing some of the key results with you guys you guys are probably wondering oh how is it doing in terms of the leaderboard so actually currently if you um use this approach and you you um put it in the public leaderboard in kaggle it's giving you a 0.665 as a card score um which is still kind of it it's it's not bad um but it's still kind of far from the top 10 as of when i was writing this which is 0.714 um but i think it's still it's still a reasonable score the fact that we're getting it kind of close and remember it's a generation problem and um also remembering that we haven't done any post training optimization here we haven't been doing any ensembling any stacking any of that it's just a basic like if you do 5 epochs and get the last model this is the accuracy you're going to get so really the amazing thing for me i wrote this here is is the confirmation that a generative model like t5 can perform extracted tasks with an accuracy comparable to a token classification version of bird so which is most of the solutions in the cargo kernels now so yeah so um but then with that said i'm still confident that t5 can reach leaderboard level results uh with more experiments i think there's a lot to be done you know you could remember this is with t5 based you can always use the bigger version if you you have a gpu that can that can fit the results um you know there's a lot of post-processing stuff you can do um you can also experiment with different kinds of specifications what if you don't do q a what if you do something else so yeah um so that's pretty much my presentation i hope you guys learned a bit um i really enjoyed creating this and yeah best of luck with you guys if you're trying to use d5 for your own for your own use case thanks so much sorry i have a check okay thank you very much lauren so it was very informative and uh i'm sorry about uh the technical issues um yeah it happens from time to time it's all good yeah it happens but i i don't want to take a lot of time from you but there are a lot of questions so one of the questions is how much time or memory does one save for fine tuning um how much time and memory does one save from fine training so um so i think there are two dimensions to that so if you're talking about time in terms of fine tuning um so what's the com what's the situation that we're comparing this to so let's say in one situation we use t5 and train five tasks on it and you know so so that takes as long as you know we do some experimentation and that takes as long um as it has to take for those tasks but then if you have to train a separate model for um you know you have to specify you need to code up a separate architecture for each of those five tasks and then you have to train them um yeah it could be significantly longer because imagine it's you don't just train something once right you apply experimentation you do some sort of hyper parameter search so i mean theoretically it's like really rough estimate but um you could say that if you're training on five tasks like theoretically doing up doing a multi-task approach using a model like t5 you could do it in the fifth of the time that's probably a naive guess maybe it's more like half but yeah i mean i expect that you know it is like significantly uh lower in terms of space it's the same argument because if you have five tasks on one t5 model um you know and assuming that you have five models and you're doing five separate tasks and just assuming that you know the size of the model in your machine is the same then you're you're technically using five you've technically cut the amount of storage that you're using um you're dividing it by five because now t5 can do five tasks so i think that was my point earlier about you know saving up in memory and and training yeah that's that's very good to know and um there's more questions um so every time we use d5 we have to format it in a question and answering format right no not not everything i mean um i used it for this specific case because t5 was already pre-trained on a question answering data set and it was specified in that that format so um if you're going to use it for an extractive uh for uh for a span extraction task the q and a um formatting makes sense um but that doesn't mean that you have to do it you can experiment with other with other approaches as well um also remember that if you read the t5 paper um specifically i actually have the t5 paper now but if you guys want to look at the t5 paper i think it's appendix d in appendix d it shows i think all of the tasks that um the supervised tasks that t5 was already trained on so you can use them um out of the box so one of them is q a you can also do cola cola you know if the grammar makes sense he goes boost summarization yeah okay uh and uh another question let me uh you're a bit quiet um a bit a bit more okay that much better okay so i don't think the audience hears anything at all give me a second uh were they hearing me um no yeah they can i thought i'd have to repeat those answers hello yeah i should be audible now yeah okay uh the next question uh i have had so many technical problems today i don't know so what kind of sentence piece model what kind of tokenizer does it use oh that's a good question um actually i'll have to double check on that but i think from what i meant remember yeah i'll have to double check like the specific tokenizer um but i think you're probably using yeah yeah yeah yeah i think so probably it's pps and speeds because i i'm also also not sure about it i also need to take a look yeah one person has asked that at the beginning you mentioned that it's a text to text model but we still have to encode and decode the text using a tokenizer so what do you mean by that phrase um so what what we mean is that what we're predicting is always um text it always gets decoded into text so for example when you have a language model i mean sure the output is going to be the ids it's going to be the tokenized form of the string but you always can convert it back into a text format um so in contrast too so let's say when you have a classification task with birth so let's say you're doing a sentiment classification task with birth so in a regular model the output would typically have some sort of fully connected layer and the output is a binary output so well you're it's a score right in between zero and 1. so the output is not text the output is a number in this case it's a float or or an integer at most but for t5 if it if you were to train t5 for a sentiment analysis task to generate literally positive the word positive and the word negative so in that sense it's text text because the input would be in this so for example a tweet and the output would be the word positive the word negative i hope that makes sense uh yeah it does i had a question like how did you get i mean i i had a similar question so how did you get the idea of using questions and answering for spam extraction so how did you get oh um i got the idea because question answering in the way that the stanford question answering data set defines it is extractive in nature so you have a context paragraph and um you're trying to get the subspan within the context paragraph that has the answer to the question so my thought was the model since t5 is already pre-trained on this the squad data set if you use the same format as the squad data set you can probably activate activate some of the neurons that already know okay i have to ex to extract the span so that was my idea that by using the same format as squad you would be reusing something that it's already learned which is i know how to extract something from the text input if that makes sense like in this model uh is it is it generating new words which which are which it has not seen so like when we are talking about the text test data so do we have new words being generated because it's not just span selection right it's just generating a lot of words correct how do you take care of that oh that's that's a good point um so personally in in the cases that i've seen i haven't seen um glaring issues about it um there are some ca like when when when i was doing the decoding the issues i found weren't really with weird tokens coming out or like with extra words it was more of with the tokenization because sometimes when you encode and then decode sometimes like a period a space period gets misplaced but then interestingly that's what i find super interesting it's just generating text but it seems to really understand that it has to expand uh it has to extract the span and that's what i found really profound because i think that's how humans think like when someone asks you a question you generate text in the form of thought and you translate it with words but and like i thought that was analogous to how t5 was working so yeah so i mean and if you read the paper they had a line about this like uh they had something and i'm paraphrasing here they had a part where they were saying that you would naturally expect that there's a chance that it will generate words that you wouldn't expect so for example if it's a sentiment analysis it should be positive or negative but what if it generates dog like out of nowhere and according to them they haven't seen a case where such a weird occurrence would happen so i thought that was really interesting um yeah yeah okay that sounds good i think uh we have we have too many technical problems today i'll have to take a last last try okay um probably you can repeat the question after me so so the question is what kind of uh approach would you suggest if one would like to apply model explainability or interpretability to a big model like this one can you repeat that is an interest what do i so the question is um what would be a recommended approach to um doing model interpretability or model explainability with a model like t5 so that's actually a good question um i wish i had like i mean there is a standard answer to this if for everyone who's worked with transformer models um and there are more i think there are more cutting edge answers i'll give you the standard one um which is you could look at the attention um attention matrices and see like which words are are being looked at more um honestly like i mean it is the simple answer it's a standard answer but you know it actually has worked for me i think there are tools that make it easier to visualize like which words are being paid attention to more um but yeah i'm pretty sure there are more approaches now i i think like transformer um explainability is like a big a big thing now so but yeah that's what i will do okay so um um can you still hear me fine okay so i think the audio is fine back up i i don't know what happened but anyways uh so i mean there have been a lot of questions and questions are still coming so probably when you have time you can go to the youtube chat and take a look and try to answer them in comme
Original Description
This is Episode 3 of Talks Series and please note that it is one hour before the normal time for Talks :)
Title: Introduction to T5 for Sentiment Span Extraction
Abstract: T5 is a recently released encoder-decoder model that reaches SOTA results by solving NLP problems with a text-to-text approach. This where text is used as both an input and an output for solving all types of tasks. I believe that the combination of text-to-text as a universal format for NLP tasks paired with multi-task learning (single model learning multiple tasks) will have a huge impact on how NLP deep learning is applied in practice. In this presentation I aim to give a brief overview of #T5, explain some of its implications for NLP in industry, and demonstrate how it can be used for sentiment span extraction on tweets.
Speaker Bio:
Lorenzo Ampil is a Machine Learning Product Manager and Data Scientist at Thinking Machines, a global AI consulting firm w/ operations in Singapore and Manila. He specializes in developing products that utilize deep learning and machine learning on NLP for various industries. Prior to this, he set up his own consulting practice where he provided end-to-end data science solutions for finance and tech companies in Southeast Asia and Australia. He also previously worked at Uber as an analyst, where he handled projects related to NLP, analytics, and automation for the APAC region’s community operations.
------
If you want to be a speaker and talk about your #MachineLearning #DeepLearning Projects, then please fill out this form: https://bit.ly/AbhishekTalks
Follow me on:
Twitter: https://twitter.com/abhi1thakur
LinkedIn: https://www.linkedin.com/in/abhi1thakur/
Kaggle: https://kaggle.com/abhishek
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Abhishek Thakur · Abhishek Thakur · 26 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
▶
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Episode 1.1: Intro and building a machine learning framework
Abhishek Thakur
Episode 1.2: Building an inference for the machine learning framework
Abhishek Thakur
Episode 2: A Cross Validation Framework
Abhishek Thakur
Tips N Tricks #2: Setting up development environment for machine learning
Abhishek Thakur
Episode 3: Handling Categorical Features in Machine Learning Problems
Abhishek Thakur
BERT on Steroids: Fine-tuning BERT for a dataset using PyTorch and Google Cloud TPUs
Abhishek Thakur
Special Announcement: Approaching (almost) any machine learning problem
Abhishek Thakur
Training BERT Language Model From Scratch On TPUs
Abhishek Thakur
Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-1)
Abhishek Thakur
Bengali.AI: Handwritten Grapheme Classification Using PyTorch (Part-2)
Abhishek Thakur
Episode 4: Simple and Basic Binary Classification Metrics
Abhishek Thakur
Training Sentiment Model Using BERT and Serving it with Flask API
Abhishek Thakur
Episode 5: Entity Embeddings for Categorical Variables
Abhishek Thakur
Tips N Tricks #5: 3 Simple and Easy Ways to Cache Functions in Python
Abhishek Thakur
Multi-Lingual Toxic Comment Classification using BERT and TPUs with PyTorch
Abhishek Thakur
Text Extraction From a Corpus Using BERT (AKA Question Answering)
Abhishek Thakur
10K Subscribers: Approaching (almost) Any Machine Learning Problem and Talk Show
Abhishek Thakur
Data Processing For Question & Answering Systems: BERT vs. RoBERTa
Abhishek Thakur
Tips N Tricks #6: How to train multiple deep neural networks on TPUs simultaneously
Abhishek Thakur
Sentencepiece Tokenizer With Offsets For T5, ALBERT, XLM-RoBERTa And Many More
Abhishek Thakur
Talks # 1:Andrey Lukyanenko - Handwritten digit recognition w/ a twist & topic modelling over time
Abhishek Thakur
Episode 6: Simple and Basic Evaluation Metrics For Regression
Abhishek Thakur
Talks # 2: Subhaditya Mukherjee - Image restoration using Deep Learning: Dehazing
Abhishek Thakur
Basic git commands everyone should know about
Abhishek Thakur
How do I start my career in Data Science?
Abhishek Thakur
Talks # 3: Lorenzo Ampil - Introduction to T5 for Sentiment Span Extraction
Abhishek Thakur
Detecting Skin Cancer (Melanoma) With Deep Learning
Abhishek Thakur
Talks # 4: Sebastien Fischman - Pytorch-TabNet: Beating XGBoost on Tabular Data Using Deep Learning
Abhishek Thakur
Build a web-app to serve a deep learning model for skin cancer detection
Abhishek Thakur
Talks # 5: Parul Pandey: Data Science, Diversity and Kaggle
Abhishek Thakur
Implementing original U-Net from scratch using PyTorch
Abhishek Thakur
Tips N Tricks # 8: Using automatic mixed precision training with PyTorch 1.6
Abhishek Thakur
Talks # 6: Mani Sarkar: From backend development to machine learning
Abhishek Thakur
Dockerizing the skin cancer detection web application
Abhishek Thakur
How to train a deep learning model using docker?
Abhishek Thakur
Building an entity extraction model using BERT
Abhishek Thakur
Train custom object detection model with YOLO V5
Abhishek Thakur
Talks # 7: Moez Ali: Machine learning with PyCaret
Abhishek Thakur
How to convert almost any PyTorch model to ONNX and serve it using flask
Abhishek Thakur
Hyperparameter Optimization: This Tutorial Is All You Need
Abhishek Thakur
I finally got a copy of "Approaching (Almost) Any Machine Learning Problem"
Abhishek Thakur
Captcha recognition using PyTorch (Convolutional-RNN + CTC Loss)
Abhishek Thakur
Live Q&A: Getting Started With Data Science
Abhishek Thakur
WTFML: Simple, reusable code for PyTorch models
Abhishek Thakur
Talks # 8: Sebastián Ramírez; Build a machine learning API from scratch with FastAPI
Abhishek Thakur
Data Science PC Configs: From Low Range to Super-High Range
Abhishek Thakur
BERT Model Architectures For Semantic Similarity
Abhishek Thakur
I just got access to GitHub's Codespaces and it's amazing!
Abhishek Thakur
Talks # 9: Vladimir Iglovikov; Detecting Masked Faces In The Pandemic World
Abhishek Thakur
Tips To Build A Good Data Science / Machine Learning Project (For Your Portfolio)
Abhishek Thakur
Docker For Data Scientists
Abhishek Thakur
How To Become A Data Scientist In 1 Year (Learn From A Real World Example)
Abhishek Thakur
Talks # 10: Tanishq Abraham; What are CycleGANs? (a novel deep learning tool in pathology)
Abhishek Thakur
Deploy Any Machine Learning Or Deep Learning Model On Google Cloud Platform (App Engine)
Abhishek Thakur
Pair Programming: Deep Learning Model For Drug Classification With Andrey Lukyanenko
Abhishek Thakur
VS Code (codeserver) on Google Colab / Kaggle / Anywhere
Abhishek Thakur
Talks # 11: Jean-François Puget; Did you know GPUs are not just for Deep Learning?
Abhishek Thakur
End-to-End: Automated Hyperparameter Tuning For Deep Neural Networks
Abhishek Thakur
Deploy Any Machine Learning (or Deep Learning) Endpoint on Google Cloud Platform In 10 minutes
Abhishek Thakur
Ensembling, Blending & Stacking
Abhishek Thakur
More on: Fine-tuning LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI