The Future of Natural Language Processing

HuggingFace · Advanced ·📄 Research Papers Explained ·6y ago

Skills: Reading ML Papers90%Research Methods80%Fine-tuning LLMs70%ML Maths Basics60%LLM Foundations50%

Key Takeaways

The video discusses the future of Natural Language Processing (NLP), focusing on transfer learning, model size, and computational efficiency, as well as current trends, limits, and future directions in NLP research.

Full Transcript

hi everyone I'm Thomas well from hanging face and today we're going to talk about an exciting topic the future of NLP well more precisely the future of transfer learning in NLP to be honest this talk is like a personal work through some of my favorite paper and research direction of the last few month so I really hope you enjoy it as much as I do we're going to talk about a lot of things we start by talking about model size and the data requirements then we'll talk about in domain brushes out of them in generalization we move on to fine tuning model evaluation what are the problems with all the limits of these things and then we end up discussing common sense and addictive biases okay so let's start by the other fans in the room okay you probably have noticed the male are getting bigger and bigger there is a nice graph that Victor Shannon made last year which shows how these models are just getting crazy bigger it's exponentially increasing so now the state of the art models there are over 1 billion parameters and they actually fall of that because several models are now like 10 billion parameters like T 5 and the Turing model for Microsoft and you have a huge problem with these models because they don't even fit on one GPU I'm not even into GPU you need like four to a GPU just to load this model and to run them with a batch size of one okay now why is it a problem well that's a huge problem because if you check like the current leader box for instance the glue leader bro that you can see here you can see that the competition is narrowing okay it's all about the same teams now there's like Google Microsoft Amazon by your Facebook that's pretty much it you see there is no academics there okay because just two models are too big the computational requirements are too big so there's a huge problem of diversity and also where there's academia feeds in currents recision lb okay another problem that you've probably seen is the environment of course of finding smaller okay they require a lot of energy energy consumed and generate carbon taxes and so we know that training is mobile isn't it good for the environment what can we do and the last problem was very well studied by for specialized means that if it does go bigger what do we expect we expect like to see a phase transition at some point or is it just like building bigger scales to try to to reach the moon okay now there is another which is to go the other way we know not since this very nice paper vehicle I really like the title optimal brain damage in 1989 we know that neural nets are over Parma tries they have too much weights we can just prune them and they're like the most recent the most recent example is the lottery ticket ipods which say that if you take a randomly initialized model you can actually find a subset inside of these models which already has good good performances on your test tasks ok you don't even need to train your model you can just take this big model and find a subset do small sub networks inside of it that's already nice for your task and we see that in fact you model as well when you when you find you model you can remove weights you can select the waste and you keep the performances so here you see this example on NLP tasks where actually you can remove like 90 percent of the weights and keep the same performance system ok so we want to push in this direction so here is a small promotion we're doing competition that study actually two days ago which is about getting the more efficient models that you can ok it's called sustained LP it's a workshop but we will click collocated with yemen LP at the end of the year and the goal is to get this to the same special as the current state-of-the-art model like birthdays for instance and just to try to be the most efficient the most energy efficient that you came okay the competition is only only firms for now because inference is actually one of the the biggest part when you when you look at the lifetime contribution of a model like the lifetime computational cost when these models they are like deployed in application unlike thousand of servers actually inference courses is the biggest part of their lifetime environment cost okay so if we can get better on difference we like already a long ways well goal of getting more efficient models now how can you reduce the size of the models okay so we start by this let's go a bit into that there are many them they're mostly likes free techniques that you can use the first one is called distillation second one is pruning and then there is conscious action okay so distillation here is a good example we made a model called distilled birds at the end of last year which get like 95 fancies of birth model unglue and is like 40% smaller how you do that where you take birds as a teacher model which means you have a pre trained bird and then you will train a student model which will be smaller and you train the student model to reproduce the generalization capabilities of the teacher okay so here is a good example you see this sentence I think this is the beginning of beautiful and then the model is asked to complete so bird is trained like that bursty stranger predicts mass token so here you see the top prediction of birds and you see they all make sense okay you see daily life future story all this top prediction that bird things are possible they all make sense this is what we call generalization model bird model learn to generalize okay we are beyond or just a simple training example in this in this sentence and what we do is that the students model will be trying to generalize in the same way as the teacher model so it will learn intuitive bias that the teacher model has learned it's very easy we just do a croissant robe it's called a knowledge distillation and we do stew a person from between the output of our students and output of our teacher okay when we train you can use temperature to like emphasize the lower probability that's very that's very common trick in unhappy so now a lot of people have been publishing in on distillation at the end of last year you can see a couple of paper like the state of the art distillates with distillation models are kind of very complex now tiny birds will be the example where the student model had to be as a smallest size of Interstates and they try to make tuna model also mimic to hidden states of the of the teacher so you have down down you have a down projection from the t-shirt to the student they also user all of that augmentation so it's kind of tricky to know exactly what is the good what part of the good performance of this this latest model come from that segmentation would come from this nation but definitely people can get like very small models with good performances using mix of distillation and that segmentation now let's move on to the second technique you can use to reduce the size of the mobile okay which is called prude improving you directly work on your teacher model and you actually remove waste from this model to make it smaller okay there are various way you can prove and one simple way is actually to remove the tension heads in your transformer it was shown that you join two nice paper of last year one by Alina boy it's a from Edinburgh University and is a other by a pony shell at CMU and they show that you can actually remove a lot of the heads of transformer model after they've been trained and you can keep very good performances so on the top you see the results on translation you see that you can actually remove a 90 percent of your heads and keep a very good blue score and at the bottom you see the result on blue which is the general language understanding match mark and you see pretty much the same performance so one way you can identify the heads you should remove is by using what Michele and polish shells will call the score the head importance coal which is actually their the grant of the loss we forgot to their to the output of the attention layer and if you if you remove the the last important heads first you can actually kill these very good performance what is interesting is that you see on the other slash graph here is that you can actually if you remove some heads that are less important for one task here it's on Amanda lie you can actually see that this exceed this is quite resilient to dominate updation so here in the middle I you have to power from the data set you have what is called the match data set on a mismatch that's it and if we move some hairs around at useful for one you can see that actually it's kind of related well okay the graph is not exactly linear but there is some correlation with heads that are not important on another domain okay so which means that there are some has actually not useful for anything at least on MNLA okay that's interesting because it means this is quite resilient to the maladaptation now you can also die out here remove the weights when you remove the weights it's actually more fine-grained because each each specific weights can be removed but the problem is that you will end up with a very sparse matrices that are not so good for us for GPUs we'll talk about that later but you can get also very performances here is a nice paper from a SAP team with cherry whirlwind and the Hang Wong and they did a very nice paper that his witness were complete rain models as well removing weights with a nice differentiable s0 pruning okay they use like a are concrete distribution which is basically the the Gumbel softmax break okay and the last part is actually layer pruning in layer pruning this is nice paper by under a fan of last year at Facebook as well they actually remove full layers of the transformer so this is this is really a lot okay you remove this this pool player so the way you can actually do that and have the model still behave quite well is by training the model to be resilient to that so during pre training you will randomly remove weights remove layers sorry like a dropout okay it's a structured approach to drop layers and so the model learns to actually behave well without some ways it works well because this transformers layer they are like a repetition of the same module okay and you have this residual connection this shortcut connection which means that actually one layer and the next one there are kind of they always connected with a shortcut as well so when you remove a layer it's actually less aggressive than in some like fully connected models without shortcut connection so layer pruning is very interesting as well and you keep these dance matrices because you really move full blocks of weights so why am I talking about this problem of sparsity well because all these models will run them on GPUs and GPUs our GPUs but CPUs and GPUs they are really optimized for dense metric multiplication okay they have troubles with sparsity and when you use this space model on GPU on GPU or GPU they're usually way slower it can be like three to four times over to run so they're smaller indeed but they also lots a lot slower and it's also not efficient so you're losing what you were actually looking for which was energy efficiency so they're values where you can try to circumvent this one way is to use what open area was promoting which is block sparsity so instead of removing all these weights single weights you have to remove blocks of weights and these blocks have the nice size that is a adapted to your GPU or GPU kernel which means that you keep dance metric multiplication well you can remove blocks but actually when you do like strong sparsity it means you just keep blocks actually okay your matrix is just a few blocks as you can see here so this helps you the another approach is actually to make a full sparsity but with patterns that you actually control so you can keep advantages of optimized CUDA kernel now the more you structure is positive usually the less performances you can get because you actually constraining the model okay so if you have like instructors pass et usually we can keep the best performances and all the metrics but you lose the efficiency and the more you structure the sparsity the better your your energy efficiency is and using the worse your performance is so another alternative is to actually switch chips and try for instance the new IP you from graph call which are chips that are specifically designed for spaz models okay they are made of a lot of small module can process data independently and have this smaller RAM associated to them and they can actually process sparse matrices very efficiently now the last technique I want to talk about when we talk about shriek shriek shrinking model is quantization quantization is also very interesting we know that using float32 using full precision floats weights is actually not the most optimal way we know that these neural networks they also work well without precision and even quantized integral so we can do that from our transformers as well okay it converts the flow just 4:32 the full precision weights into integrate so we really reduce a lot the size of our model and we use dynamic consolation for instance where you have a scaling and zero points conversion and this works very well there was a nice work by Intel I called q8 bird and it's really working well you can try it it's very easy in Python and as well it's very easy to apply conservation and a bit like layer pruning you can do training aware cancellation so you can tell your model is gonna be quantized at the end okay you actually train it in a way that it's getting used to be quantized and so you have better performance scissors as well at the end okay okay we've talked a lot about these big models how to reduce the size now there is another things that is a increasing exponentially recently in NLP which is the requirement for more data people are using more data for training and people are also using more data for fine-tuning okay so there's a problem because when you compare two models that was they were pre trained on two different data sets of very different size it's really hard to tell if one model is better because it was betraying on more data or if it's better because usually of the like novel architectural design that people introduces good a good recent example of last year was XL net the the transformer from from Google there was the successor of transformer Excel and exhale net use a smart autoregressive training so you could actually do auto regressive training while having the possibility to attend to both contacts to left on the right context usually when we do order autoregressive training on Ingo whywe so a model is masked like the right context of each token what Malo is masked but in excel nets actually they do auto regressive with a random permutation so the model actually learns to pay attention to both context now the problem with that excel net was also trained on a lot more data than Birds so when they compared to Birds it was really hard to tell what was the difference what was the improvement that came from training on 20 times more data and which which which province came from having this new Auto regressive architecture so there was a huge debate and actually it was kind of settled by Roberto which was a very simple bird architecture that's exactly the same birds but just trained on more data basically and Robert our output from Excel net which showed that basically there was the bitter lesson of an LP and the bitter lesson of machine learning in general as reach a certain talk about it which is that if you have more data is usually output from having a smaller model okay and now there is this recent paper that we're gonna talk a lot more about which is called scaling laws for a new language model this is paper from open AI and it's a really in-depth study of what happened when you increase the data size and when you increase the model size we saw a change in the architecture so it's very good it's very good study now this is for free training but we see the same on file journey which is that people when they fight hoon they do a lot of that documentation and a good example is the Winograd schema challenge the winner grade schema challenge was very interesting that set for a long time it's very simple you see one example here you have a sentence that say for instance the trophy would not fit in a brown suitcase because it was too big and the question is what was to be was it a trophy or was it the suitcase and the model has to do a classification between these two one okay so that's very interesting because you need some common sense you need to know that the suitcase is usually bigger than your trophy and the way it was solved and for a long time it was a very hard challenge it's a very small one you'll need like 300 example and for a long time it was very hard to get good performances for deep learning models on that and the weight was solve was actually to generate artificial documentation that sets with some heuristics extracting from Wikipedia sentences where you have two times the same noun like two times trophy and replacing one of them by it like this you can build with these heuristics you can build a huge data set from any any like crawl text data set and you can pre train your model on that and then you this fine-tuning on the winner brat schema challenge after that and you can solve the task but you can see that it's not very we're not really happy about that because scientifically we have not really learned anything about common sense by doing that okay we've just learned that more data is better so let's talk a bit more about retraining first okay so I talked about the scaling laws paper let's go into depth in this paper so this paper is about one single architecture it's about the transformer train for auto regressive language modeling okay so you only have left context for each talk and is that transformer that up trying to predict the next token given the given the beginning of a sentence this is with GPT too but the experiments with many sizes with many sizes of the datasets and also they did some nice scan on the architecture there was always some question about transformer which is what is the optimal ratio of the number of heads we forgot the model size what are the optimal ratio of the number of layers we forgot to the diamond shovel models and I can show that all these doesn't really matter as long as you in the like very flat suite point where you have this nice hyper parameter that was pretty much the original attention is only unit parameter as long as you as you're in this sweet spot you good so these models are very actually rubbished to this simple hyper parameter exploration and what they show is that just by scaling and model size and scaling data set size you have a very clear power law which mean that's it's actually exponential actually that's what it means it's exponentially squeezing so if you double your model size you have this linear improvement in performances if you double your data set size you also have this linear improvement but it's power low they go over very wide ranges they go over hold of the went on over horrors of magnitude now you can read this paper it's very interesting they show that to to interesting thing for me one was that actually that was actually there was a failure follow a paper by Eric Wallace at UC Berkeley we show that it's actually better to have a too big model it's actually better that your model is more ready than we used to that and then we used to to think and for the datasets so if your model is actually slightly too big for your data set in a way we were as the size of that's it on model before you can actually get better results you go you go down you know you're lost go down faster and there is another interesting thing in this paper which is also something we saw a little bit earlier on the pruning which is that this transformer models the embeddings and the layers they behave really differently okay so when you find sauce when you prune you should prune differently embeddings and layers and here they show so that it's actually the the capacity of the model is really defined by layers and all the power law they observed they work well if you remove them it is okay if you don't take into account the imbalance when you compute the size of the model okay now there is a last very interesting thing is that they have two lows for the the decreasing glass okay one of the lows and one of the increasing loss one of the lows to decrease the loss related to the capacity of the model when you increase the capacity of the model and one is related to increasing the dead set both kind of both can be related together in terms of computation like more data means more computation bigger model also means more computations you can collect this to power low and what you see on this graph is that you have actually two slopes which mean that at some point they joined together and you can't really know you don't really know what which loss you should have okay you have one loss which is actually defined by our giving using the optimal data capacity the optimal that set and one law is defined by using the optimal model capacity and at some point they predict their prediction doesn't fit together and which is actually far above which we've been experimenting right now it's around the pizza Pepe mr. parameters regime and hear what they say that activity architecture the transformer architecture is breaking down that's what they open yeah okay so all this exploration of more data and bigger models are actually related to one idea the idea is that maybe there will be a qualitative jump in behavior if we get enough data okay the idea is like maybe just getting more data is enough to see a qualitative like a phase transition how the model behaved and there is some hints of this it's a quite interesting idea I think it's very controversial somehow because more data as I was thinking more data bigger model is this video research program right and there's this nice paper from AI - from Allen thalmor and people at and at Olay to Israel they show that actually just comparing birds on robots you can you can you can invert this a phase transition okay so comparing birch row better is interesting because they are the same architecture they're exactly the same models just that bird's was trained on only 137 billion tokens or D and roberto is trained on two point two Tara seconds okay so what I was really trained on a lot more data and here you can see this very interesting is they were short evaluation so you just take the free trade model you don't find unit and you ask a question that are kind of like the window grad scheme a challenge question okay here you ask it a 25 year old person age is then a 30 year old person I mean the model has to predict if the was younger or older so it has to actually compare numbers together and use some common sense if you want and you can see that vert is pretty bad perch is the blue curve and Rho beta which is the green curve is actually like super good at comparing this remember in the in the range of ages for people you can see the same on size comparison if you ask Roberta to compare the size of like the Sun to have the table to a house and like that Roberta is usually pretty good out of the box so it has some form of what we would call common sense and this is out of the box okay just by pre-training it's also even able to compare birth rate like birth year sorry like if you asked if somebody was very born in this year or this year who is older us which means it's the reverse than the H okay the year the higher the your birth year of this was the younger you are and the model is able to do this sweep swatch swap so that's very actually surprising I think now there is this big question when you do now fine tuning okay so we've seen that pre-training bigger data is just better and actually you may even see some phase transition now what about fine tuning okay so fine tuning means you've taken this free training model on their data set and now you want to adapt them on one task okay this paper is very important one from deep mind the evaluating learning and evaluating general linguistic intelligence it's a paper that can pose a lot of question it's an opinion paper and you should definitely read it I think if you want it's one of the most important of last year last year's paper it's like now why one-year-old and what I say is that the reason that said they're actually too easy to solve with leader generalization why because we have this training data set for the mollow that are usually often quite be like Amanda lie or SLI or squad they're really kind of big data set to fine-tune on and they give models that actually don't really have good sample efficiencies so let's let's focus what does this mean let's say we have two model we have model a model hey a as a ninety percent accuracy with like a hundred training example but then it doesn't get any better with more training example okay it can plateau at ninety percent but lb takes like one million examples to get to ninety percent accuracy but then it can increase a little bit and it end up plateauing at ninety two percent so if we do if we just do like we do usually like we compare the model at the end of the fine-tuning we will say Oh model B is not better because it can reach 92 percent accuracy well actually we should really we were Model A because Model A is able to reach very good score with just 100 training example that's really great that's what we wanted from transfer learning okay that was one of the initial goal of transfer learning was to make this model work on very small data sets and this is called sample efficiency it means how better your model gets with one additional example there are a lot of other problem with these models which are related one model is that when you find you in on these big data sets usually we get models that work well really exactly on the training and the fine-tuning domain so you have models that work well on squared for instance that work well that means they work well on Wikipedia question answering in this very narrow field of question answering but we would like for instance we don't really want squad models we would like to have question answering models that would work on any question answering tasks and this is related to sample efficiency because it means that if you just give a few general question answering example you would like your model to work already well on them okay you would like to the model not to need to function on full Wikipedia to just no question so in Wikipedia there is a related make matrix matrix which is called online code length we'd say that how much better your model will get with each additional sample okay it's an information theory metric which actually is related to how how much you can compress your mala so it's a very important matrix and it's actually probably the way to go forward so here just a few example here you see if a model was actually trained from so you can see here the benefits of transfer learning first so here on the bottom bottom you see you birds that strain from scratch on question-answering so this bird is not initialized is initial is going to be initialized okay so it's pretty bad at the end you just train on the full square data set and you just don't get very high now you can see that if bird is pre trained already on like its usual HP training which is the tanto bit copper so we keep it yeah it goes a lot faster okay so this is the benefits of transfer learning and it reached nice accuracy and now if you look at this the last part which is actually a bird that was pre trained on another question on saying that is that you can state already start very high so it means that this model is actually very simple efficient because it was already fine-tuned on another question as varied as before okay so this model when you look at them on online image it's an online code length metric you can see that they actually very different because when you actually find you this model you have to understand the birth model was pre trained now to fine-tune it we'll add a linear layer on top okay and this linear layer is randomly initialized so there is no shortcut here you will have to train this linear layer okay you cannot really bypass this when you use this model with this task specific layer added on top you need to train this task specific layer which means that you can't really observe effort okay and this mother we always have to catch up somehow they always need a few example to be able to train this last layer whatever whatever smaller layer is this is what we call task specific components and that's a very strong limit to how we can do a very simple efficient model okay now this was just to show you when you actually investigates sample efficiency you can see that it's also a good way to see if the model is actively learning the task using the knowledge it had from before or if it's actually learning the task from scratch so here you see comparison between bottom robots all right you know that robots are we've been showing this blue and green dragon diagram Roberta has some kind of good common sense better than bird ok and we can see that because when we find your Obata it's a lot more sample efficient with just a few example it's already catching it it's already getting better matrix on birds and by comparing the sample efficiency curve which is the performance of your model while you just use progressively more more sample to function it just by comparing the curve for birds on rebuttal you can see I think you can have a good idea of how much your pre-training was helping to to get good performance on your target task ok so this paper is also very nice investigation on that and that's actually posted a nice question of how much data should we need and this actually lead us to the next topic which is in domain versus our domaine what we would like in general is out of the main generalization what we have usually is in the main journal is Asia what does it mean let's have a look we've trained our model on question answering datasets now we are experimenting with like real life where question answering is different like the domain is different the language people use is not Wikipedia language and we see there is a strong performance rope because our model is not really capable of out of dimensionalization here is another nice example on this paper by thomas mccoy and we show that actually if you train birds and has good performances unlike your fine tuning that's a set i glue okay you can then you can then test it on another data set which is a out of domain so how they did out of the main is here is by having some heuristics so for instance you try to make like for instance lexical overlap heuristics so in Amidala you have two examples and you have two sentence and you have to say if one entail the other or contradict the other okay they are very simple heuristics for that in the data set which means that usually if there is not means contradiction if there is a lot of flexible overlap it means use a entailment so they build an adversarial data set called the hands which is in a transformers library actually you can use it we have an example it and which is adversarial so when there is a lot in this data set enhance where there is a lot of lexical overlap its contradiction the good label is contraction here is an example here and what they show that is that they can train several birds on this fine tune it with different random seed okay so the difference is is very small the difference between this model is very small it's just the weight initialization of the last layer and this model they behaved similarly uh nominally they are like various very similar performances but when you test them on the adversarial hunts data set they behave really differently okay there are this huge variability some of them are pretty good well none of them is really really great but some the magnet so bad that some of them are really bad and this means that actually what you see in domain to test performances you see just give you no indication of how your model will behave in the real world which is kind of bad okay here is more example than what they do on them in a lie their values heuristics they use to design and you can see you have more or less variability in the fine-tune model some heuristics leads to like really a huge variance which means that you can't really know how your mother will behave in the real world unless you be able to honest you able to test it on real data and so on my like small have a smaller effect okay now it's really hard to investigate out of the metallization so one way is to do this kind of heuristics another way is try to build our datasets ourselves so we can control them so the only really interesting Phillips in this work is the work and compositionality compositionality is to investigate how you model is actually able to combine values part of a sentence to build a meaning representation this is very important because we think that in linguistics composition is something important that we do when I say the blue dog is going out you kind of gather blue and dog together in a single in a single meaning and then you combine this with the rest of the sentence to build up the meaning so there's a nice work called scan and pcfg sets which was actually a really really long but super interesting paper by age of two up case from Amsterdam University and they can build a huge data set that's replicates some natural language data sets so they be status in which you you have to combine instruction together to generate an output okay and you have to combine instruction compositionally so you can generate your nice output so you have like it's written like repeat something and there's something we have to be considered as as a single entity and then the repeat function at the apply on it and they were able to naturalize the data set so they were able to reproduce the depth and the length of like a translation or a very big translation that I said VMT challenge okay so they're this artificial data set you generate this instruction yourself but which really replicates well natural language and WebRTC artificial deaths that you can actually do some out of the man generalization so Finance in your training part you can remove some instruction some words do the model will never see them I like some way to combine words and it can estate how the model will learn to do that and I like one of the most fascinating experiments have so last year which is called over generalization over the duration is like super fascinating it's it's a bit like when you're when you're kid you're learning language and you make mistakes but this mistakes are like smart mistakes okay for instance you will add Edie at the end of a past verb verb in the past tense but it's like an irregular verb we say instead of saying I went we say I go with okay and this is called smart mistake because it means you've learned room you've just not learned yet the exception and we really want our model to learn rules so they can generalize outside of the training domain okay so you can investigate that by putting some irregular verbs in this article that said so here you can investigate that and the nice thing about these paper is that they compared the complete various architectures they compared lsdm they comport kind of nets the commands former together and what they see is the really very important C's like Alice then they cannot struggle with this question of over memory of over generalization and transformer are really lot better you three can cornetist somewhere in the middle so this is this nice graph where you see on the top you have like very few examples like a very few exception so it's kind of hard for the model to learn that so here you see during the training the red means that the model is over generalizing the blue means that the model has lone exception and the gray mean that the model actually don't we know what to do so it's predicting like random output which is neither the role Naser expect the exception and when you have only a very few sample like very few exception although just can't really get them cornet manage a little bit to do that but transformer and they don't when you have a lot of exception the model learned just to memorize them that would we see in your network okay they are very good at memorizing brute force memorization and when you were in the middle you see a bit something that's similar to the way a human learn which is much you start to have over the ionization during your training you have the peak when you actually open your eyes everywhere and then you learn that there are some exception so it's very interesting and it shows that this model are capable of some out of the main generalization somehow now talking about in domain and out of the mineralisation posed the question of how do you measure the distance between your two domains and that's a very open question there is a large body of work on domain adaptation that is trying to show that you can actually extract some feature from your data set and you can compute some similarity metrics on them but it's definitely a very open question how can you measure the distance like in a statistical meaning or you can measure the distance between two data sets okay now you can know when you're not in the main anymore okay so I think talking about in domain versus out of the miniaturization we're talking about this question of sample efficiency we saw that using this task specific component was actually a problem okay because we have to fine-tune them on each task we have to fine-tune them and they are like limiting how efficient we can do like they are like increasing the number of sample we need to learn the target task that we have at the end and this is actually related to the rise of analogy so let me show you a little bit okay recently we've seen more and more text to text model this was tearing but studied by this nice say swap a paper by a bright Mack cane which was called the natural language Decathlon it was a task it was a benchmark where you have like 10 tasks to take at home and you have to just they were all cast in the same format in the same framework they were all cast as Christian answering tasks so if you have translation you would have a question which is translate something and translate this from English to German and then you have a context which is the English sentence okay when you have summarization we actually formulate that as a textual input which is summarized this and then you have a the newest example a new model has to generate the output it has to generate the trans ladies the German Association it has to generate the summarization so it's not classification task like we saw before but it's generation task gbt 2 is a big model that makes a lot of PR but was also very very nice paper called multitask language Molalla unsupervised multitask learners and there was a lot of 0 shut experiment in this how do you do the shot with CPD - you do the same you actually formulate your task with a prompt which is fine some summarize or for summarization they did what they they could TLDR - too long didn't read and then they put the the sentence to summarize and the model is a actually train to generate is not training so 0 shot the model try to generate some plausible a completion and the plausible completion will be a summary and gb2 is quite good at that a lot of tasks can be formulated like that like the Lambada data set which is very interesting tasks where you try to predict the last word and the last word of sentence is something that is not explicitly said in the beginning but uses just implied for instance people are talking about giving birth but you don't say it's not explicitly mentioned in the paragraph and at the end you have to talk you have to complete with the word pregnancy ok so the model has to understand the underlying meaning of the sentence to be able to put in the right word so this is completion tasks it can be formulated as text generation where you generate the next word and we've seen a rise of models like this which are trained to generate world and where we actually recast our usual classification or usual NLP tasks as test text to text generation tasks ok and we've seen that in a lot of recent model Facebook birth model which pre trained with the text to text objectives so it was pre trained by giving it Corrib text where you have like randomly dropped tokens randomly dropped words or the text is shuffled you can see all the the objectives here and the mother is trying to regenerate the clean text from that ok so you can formulate this the noising objective as a text to text generation the correct text to clean text generation and they even train a multilingual model called embowered on this so we we have this both model now on transformer so you can try them and this models they're trained in this framework and the most famous one is the recent t5 Google is mostly famous because for some time with a short amount of time it was the biggest model so the 11 billion parameter model ant if I mean is P trained and fine-tuned like this it's pre trained with a denoising objective like the one with sofa board and it's fine tuned in a text to text format so for instance on blue tasks like a manila I you have to pretty contain my no contradiction and you will have you will formulate your task as a text input another we have to generate entailment the word on statement or the word contradiction why is it great this is great because with this we don't really have to fine-tune any additional layer okay we don't actually add any layer to our model we take the same architecture for pre-training and for fine-tuning there is nothing to fight you in on to train from scratch which mean that in theory we can do zero shot because no weights needs to be fine-tuned on the on the target task ok the model is ready to be used on the target task now usually this means we need to do like target target task inform for training like we need during the pre training that's what is very interesting in the t5 paper that you can read as well during the pre train they can doing for training the prepared model for this task by giving some example of the fine-tuning as well so the model knows that it will be asked to do some containment or contradiction between your question but then you can have 0 shots and actually when you look at what Sam Bauman is saying about glue and super glue is their successor to this task so that is actually really hard now to find some datasets where our key we can have a good classification like a good NLU task where you actually can prevent this model to which human preferences where this model don't even already reach human performances and usually they remain preferences because they are taking advantage of this fine-tuning task okay so in general we should I think we should really focus on 0 shots adaptation for transfer learning like zero shot of very few sample efficiencies your adaptation okay I hope that's take away of all this Diskin discussion now let's go back a little bit we've talked a lot about the quantity of data we've talked a lot about the size of the models but all these models there are some common problems behind this quantity of data which is their lack of robustness they aren't there a few thing but for instance one is like the lycra business when function so let's start by talking about that and then we talked about the lack of robust test or if we got to common-sense weight when you find you this model you can usually see some pattern like this so this is a nice graph from jason funk paper called a sentence of korean stilt and they find hubert with just just wearing the random seed and they show that this model say they are very easily fall into what we called local minima so you can see this on of behavior sometime the model work now if someone themselves it just doesn't work at all so there was a follow-up recently by u-dub just dodge paper and it's also exploring when you just vary the random seed for fine-tuning or how the preferences of this model are behaving and they saw the same thing that the model are very sensitive to the random seed and they have this they are very easily they very easily fall into local minima what i call local minima is that it has bad performances and stuck in this in this video okay so how we solve that usually we saw that with the very brute-force approach that you can see for roberta for instance which is that you will train hundreds of models you will fine-tune hundred of models on various fine-tuning set up you're exploring the full hyper bomb into space and just keep the best one okay we talked a little bit about that better now that's one way to mitigate that the other way is that we probably need just better regard regularization okay so the mix out paper is very interesting they show that actually when we use dropouts we usually use dropout to fine-tune this model in drop out we replace some weights by zero okay well when you do fine-tuning is it is it good to actually have the model regularized to add zero maybe instead of replacing the weight with zero we should replace them with the pre-trained value so we keep them a little close to a pre trained model okay and they show that it's some form of adaptive l2 actually and the model are behaving better with this regularization objective that you can see on the on the on the gray map here now they are more they can be a lot more complex with polarization and all the work of Microsoft on the MT DNN models that were tapping the blue leader ball for some mamma for some time is also all these various regularization you can do so you can do also organization where you try to limit the evaluation in the weights during fine-tuning there's a lot of there are a lot of where you can do that to realize this model but it's probably the way you should go and then the last way you can do is actually just to train a fine tune a lot of this model and to assemble them so you can fine-tune them with multi task as well in which you actually try to increase the domain the data set size by gathering several tasks together and then you train several model that you assemble and if it's too big at the end because you have several model you can just distill them back in a single model okay which is nice but that's very complex and actually when you look at the typical setup right now to get the SOTA on blue that black sent me the other day it's pretty crazy look at this you have to prevent your model so here as we said just use as much data as you can't as much compute as you can then you have to tune the fine tuning I prepare meet your Lots okay you do that type notation so you will some for some specific tasks of glue you will start by fine tuning on W and Ally which is the biggest at a set of glue so you get some data augmentation you increase that we've unlabeled data like we saw for Winograd the schema challenge you increase that with additional label data wherever you can use it okay then you can use some tricks that are actually not normally forbidden but everybody used to do a pairwise we're ranking where you actually exchange information between examples in the in the queue and I and W and I that's it and then you fine-tune as many model as you can you take the five to the ten best of all these models you and sample them and you submit that as your results so this is just crazy computer and it's definitely overfitting a lot to glue and it's a big problem why is it because all these hyper parameter search now we know that actually if you take every any kind of model that was shown by the gaba Meli's paper for lsdm last year if you find unit well enough you can actually reach some very good performances but you've used a huge compute budget to fine tune it okay so this was actually formalized in this nice paper by a just judge which is called show your work which was an ACL paper last year which say that we should not just we should not just report the end evaluation metrics but we should report what happened during I have a parameter search because it gives information on how much computer needs to actually get this model to good performances okay we talk about that force for data sets we said that if your model needs like 1 million that sample that said to get a good performance it's actually it should be advocated and we should know that so we can select also the more efficient model and you're the same for hyperparameters search if your model needs a crazy high programa to search to reach the good performances we should know about that so the show you work paper say that you should give these curves that show all you during all your hyper primates search how you model was behaving and what was the best one we see the same thing with standard splits the these deaths it has splits in standard training test which is nice to comfort model but it's rich people to it lead people to overfit on some sana splits ok so the specific heuristics that will work under standard train split people we can over feed them and almost unco them in the models which is bad so this guy government paper runs the barrack paper at ACL we need to talk about sooner speech is also very interesting read and they advocate for random randoms please I don't know if random split is a solution maybe the solution would be to have several standards it but definitely just a single sauna split is bad and people over Fitz and

Original Description

Transfer Learning in Natural Language Processing (NLP): Open questions, current trends, limits, and future directions. Slides: https://tinyurl.com/FutureOfNLP A walk through interesting papers and research directions in late 2019/early-2020 on: - model size and computational efficiency, - out-of-domain generalization and model evaluation, - fine-tuning and sample efficiency, - common sense and inductive biases. by Thomas Wolf (Science lead at HuggingFace) HuggingFace on Twitter: https://twitter.com/huggingface Thomas Wolf on Twitter: https://twitter.com/Thom_Wolf

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from HuggingFace · HuggingFace · 1 of 60

← Previous Next →

The Future of Natural Language Processing

The Future of Natural Language Processing

Trends in Model Size & Computational Efficiency in NLP

Trends in Model Size & Computational Efficiency in NLP

Increasing Data Usage in Natural Language Processing

Increasing Data Usage in Natural Language Processing

In Domain & Out of Domain Generalization in the Future of NLP

In Domain & Out of Domain Generalization in the Future of NLP

The Limits of NLU & the Rise of NLG in the Future of NLP

The Limits of NLU & the Rise of NLG in the Future of NLP

The Lack of Robustness in the Future of NLP

The Lack of Robustness in the Future of NLP

Inductive Bias, Common Sense, Continual Learning in The Future of NLP

Inductive Bias, Common Sense, Continual Learning in The Future of NLP

Train a Hugging Face Transformers Model with Amazon SageMaker

Train a Hugging Face Transformers Model with Amazon SageMaker

What is Transfer Learning?

What is Transfer Learning?

The pipeline function

The pipeline function

Navigating the Model Hub

Navigating the Model Hub

Transformer models: Decoders

Transformer models: Decoders

The Transformer architecture

The Transformer architecture

Transformer models: Encoder-Decoders

Transformer models: Encoder-Decoders

Transformer models: Encoders

Transformer models: Encoders

Keras introduction

Keras introduction

The push to hub API

The push to hub API

Fine-tuning with TensorFlow

Fine-tuning with TensorFlow

Learning rate scheduling with TensorFlow

Learning rate scheduling with TensorFlow

TensorFlow Predictions and metrics

TensorFlow Predictions and metrics

Welcome to the Hugging Face course

Welcome to the Hugging Face course

The tokenization pipeline

The tokenization pipeline

Supercharge your PyTorch training loop with Accelerate

Supercharge your PyTorch training loop with Accelerate

The Trainer API

The Trainer API

Batching inputs together (PyTorch)

Batching inputs together (PyTorch)

Batching inputs together (TensorFlow)

Batching inputs together (TensorFlow)

Hugging Face Datasets overview (Pytorch)

Hugging Face Datasets overview (Pytorch)

Hugging Face Datasets overview (Tensorflow)

Hugging Face Datasets overview (Tensorflow)

What is dynamic padding?

What is dynamic padding?

What happens inside the pipeline function? (PyTorch)

What happens inside the pipeline function? (PyTorch)

What happens inside the pipeline function? (TensorFlow)

What happens inside the pipeline function? (TensorFlow)

Instantiate a Transformers model (PyTorch)

Instantiate a Transformers model (PyTorch)

Instantiate a Transformers model (TensorFlow)

Instantiate a Transformers model (TensorFlow)

Preprocessing sentence pairs (PyTorch)

Preprocessing sentence pairs (PyTorch)

Preprocessing sentence pairs (TensorFlow)

Preprocessing sentence pairs (TensorFlow)

Write your training loop in PyTorch

Write your training loop in PyTorch

Managing a repo on the Model Hub

Managing a repo on the Model Hub

Chapter 1 Live Session with Sylvain

Chapter 1 Live Session with Sylvain

Chapter 2 Live Session with Lewis

Chapter 2 Live Session with Lewis

The push to hub API

The push to hub API

Chapter 2 Live Session with Sylvain

Chapter 2 Live Session with Sylvain

Chapter 3 live sessions with Lewis (PyTorch)

Chapter 3 live sessions with Lewis (PyTorch)

Day 1 Talks: JAX, Flax & Transformers 🤗

Day 1 Talks: JAX, Flax & Transformers 🤗

Day 2 Talks: JAX, Flax & Transformers 🤗

Day 2 Talks: JAX, Flax & Transformers 🤗

Day 3 Talks JAX, Flax, Transformers 🤗

Day 3 Talks JAX, Flax, Transformers 🤗

Chapter 4 live sessions with Omar

Chapter 4 live sessions with Omar

Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker

Deploy a Hugging Face Transformers Model from S3 to Amazon SageMaker

Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker

Deploy a Hugging Face Transformers Model from the Model Hub to Amazon SageMaker

Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker

Run a Batch Transform Job using Hugging Face Transformers and Amazon SageMaker

[Webinar] How to add machine learning capabilities with just a few lines of code

[Webinar] How to add machine learning capabilities with just a few lines of code

Hugging Face + Zapier Demo Video

Hugging Face + Zapier Demo Video

Hugging Face + Google Sheets Demo

Hugging Face + Google Sheets Demo

Hugging Face Infinity Launch - 09/28

Hugging Face Infinity Launch - 09/28

Build and Deploy a Machine Learning App in 2 Minutes

Build and Deploy a Machine Learning App in 2 Minutes

Hugging Face Infinity - GPU Walkthrough

Hugging Face Infinity - GPU Walkthrough

Otto - 🤗 Infinity Case Study

Otto - 🤗 Infinity Case Study

Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it

Workshop: Getting started with Amazon Sagemaker Train a Hugging Face Transformers and deploy it

Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models

Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models

🤗 Tasks: Causal Language Modeling

🤗 Tasks: Causal Language Modeling

🤗 Tasks: Masked Language Modeling

🤗 Tasks: Masked Language Modeling

The video discusses the future of NLP, focusing on transfer learning, model size, and computational efficiency, as well as current trends, limits, and future directions in NLP research. Viewers can learn about the latest advancements in NLP and how to apply them to real-world problems.

Key Takeaways

Read and understand NLP research papers
Identify current trends and limits in NLP research
Design and conduct NLP research experiments
Analyze and interpret NLP research results
Fine-tune pre-trained NLP models for specific tasks
Improve model performance using fine-tuning techniques

💡 The video highlights the importance of transfer learning, model size, and computational efficiency in NLP research, and discusses current trends, limits, and future directions in the field.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom