GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·🧠 Large Language Models ·5y ago

Key Takeaways

The video explains the GPT-3 model, a few-shot learner, and its capabilities, limitations, and potential applications, including language modeling, text generation, and comprehension tasks. It also discusses the model's performance, biases, and potential uses in artificial general intelligence.

Full Transcript

what's up folks uh so in this video i thought covering this famous uh gpt3 model titled uh language models are a few shot learners uh published earlier this year by open ai so first i just want to give you an overview like a context what was happening since then and basically what we had is on one hand we had a lot of hype happening so people saying so this is artificial general intelligence it's going to solve everything and then on the other hand we had people nagging saying how uh going in this direction is going to ruin the field of machine learning and this is uh like there is no way that this is the the the right path or its agi so basically the truth is probably somewhere in between and let me show you a couple of demos before i start doing a deep dive in of this of this paper uh which i think you'll find interesting so first twitter had a lot of uh demos happening so this was one of the most famous ones so basically a guy would input a button that looks like a watermelon and then gpt3 would like kind of generate a code uh for for a button that would look like a watermelon which is pretty pretty awesome and then we had also cool applications like uh ai dungeon which actually appeared much earlier but they just upgraded their their back end using uh partly using gpt3 model so you can basically here uh enter like the genre of the of the play then you pick your character like let me take i know wizard and then let me put in the name like gpt3 let me see what it generates and basically it will generate text and then you can uh interactively uh play this game by just uh like prompting it with different text and that's super awesome and nice use case for for this model so i won't do it right now i'll i'll link it in the description so you can play with the game if you want so as i mentioned on the other side we had people who who used a single example as an argument that the model is not not as good as as as people are hyping it that it is so basically here is a question uh which is have your a toaster or a pencil and gpt3 would answer a pencil is heavier than a toaster which is not true like in probably in most most of the cases and then like second example maybe how many eyes does my food have and gpt3 would answer your food has uh two eyes so after this blog was published a lot of people were experimenting with gpt3 including this guy named gwern whose blog you should definitely check out he's writing a lot it's a bit harder to parse in my in my like subjective opinion but like uh it's a really nice resource so uh what he showed is that using like uh smart like by picking smart hyper parameters like the temperature like the sampling method maybe nucleus and by picking like the correct cont text that will that will get uh input to the model that's like the conditioning text uh you can get the gpt3 to answer all of those questions like the toaster pencil question correctly which is surprising and then like basically like uh people started talking about a new paradigm so we had like we had the software 1.0 we still have it like where you're basically designing the algorithm so like classical computer science then we have uh designing the data set so that's pretty much machine learning you're now designing the data set and then the algorithm kind of just develops by learning from the data using gradient-based methods and finally we are just we came in a situation where we are designing a prompt so as to communicate effectively communicate with this gpt3 model so guern had a nice sentence here which says sampling can improve the presence of knowledge but not the absence and so that's that's the idea behind all of this like prompt programming you basically can't so if the model doesn't give you a correct answer maybe the model wasn't incentivized to give you a correct answer maybe the model was just being a jokester so now i'm anthropomorphizing gpt3 but uh guerns showed that by thinking like that by thinking that you're pretty much communicating with a human you can more effectively get the information you want out of it so that was it for the overview so we had the hype the hyping team we had the nagging team we have we had everything in between so now like uh at the end of 2020 we know much more about both the limitations and useful use cases that models like gpt3 can have so without further ado let's jump into the deep dive let's jump into the paper uh so first thing you can notice is that like a whole lot of people working this project like this looks like like a half of the open ai team pretty much like bunch of people so it's a multi-uh person effort as all of these huge papers are and what this paper basically did was because they have a lot of resources they were able to kind of explore what happens when you're increasing the model size and especially with the uh looking at how the what the like a few shot performances compared contrast that to fine-tuning which is the usual thing to do like you you pre-trained the model and then you fine-tune it well this paper uh did the pre-training also but then without any fine-tuning they try to see what's the zero shot one shot and few shot performance of this model and we'll get into some details a bit later so let me see what this say here so by contrast humans can generally perform a new language task from only a few examples or from simple instructions something which current nlp systems still largely struggle to do so that's their like point basically humans you only need a couple of examples and you're already able to solve the new task and now it's debatable like because we we've been collecting data like like since we were born pretty much like continuously so yeah it's a complicated question but like i do agree that we need to develop better like few shot performance of our models uh then we have uh basically what they say is that so the thing is they don't update the gradients so we'll see in a moment how they do it so basically they just condition uh the pre-trained model on like a bunch of text and it doesn't have to be a bunch of text it's like basically like a prompt in natural language and then maybe a couple of examples we'll see it enough in a minute and so uh the final uh finally what they what they what they say is that uh it's able to generate text that's really hard to distinguish from the text that humans wrote and then they kind of have a whole section on broader impacts because uh this kind of technology can really be used for malicious malicious purposes as well and i'll explain i'll talk about that part also like the fairness and bias a bit later so okay let's let me first we'll go kind of backwards here i'll first want to explain you the architecture the the data set and then we'll jump to how the fusion thing works so basically the architecture is they say here the same as gpt2 which is basically the same as gpt which is basically the decoder portion of the transformer model so we can see it here uh so this is the the transformer paper like this is the transformer model from the attention is all you need paper and they took the decoder part and they just they don't need this multi-head attention because they don't have the encoding part so they they just leave the the causal masking so basically tokens can only look uh the tokens that came previously or or look at themselves so yeah so maybe i'll try and briefly explain how it works again let me take a pencil here so you basically take the input text you you tokenize it so you get let me zoom zoom in a little bit so you you tokenize it then what you do is you embed uh these tokens so that's this uh output embedding part so let's say for for the sake of argument that the like the hidden dimension the model is like 512 so basically you'll end up with a bunch of vectors that have like dimension 512 and then what you do is you have this positional encodings it's basically like a huge table so it's basically a huge table and then depending on the position here so this is like uh token number zero you basically take uh zeroth uh like row here and you just add it up uh to this embedding vector and after adding it up here you end up here and then you just have two parts of the architecture so the first one is the multi-headed attention which just basically attends to all of the tokens and creates like those nice representations when i say attends to all so it's not bi-directional as i said they have the causal masking and finally we have uh the uh feed forward uh part where we're basically uh yeah you just kind of uh independently process those those tokens token representations again so i have a video about this one but just wanted to do a quick recap of it and basically the only difference is they use something called a couple of ideas from the sparse transformer paper where you basically instead of using those like causal masks you use sparse causal masks so let's not get into too much details and yeah that's pretty much it about the about the architecture now for the training data set uh they they used so they use the common crawl which is a huge huge data set pretty much like uh crawling uh every single month uh that site uh downloads bunch of the data from the internet and what they did is they filtered it and they also used something called web text which is uh which was created by uh using uh the links from reddit which had at least three upvotes which was kind of heuristic for uh filtering like higher quality content and they just kind of like scraped all of the data that uh those links were pointing to they also have the books one books two and wikipedia so basically the point is that uh depending on the quality of the data set they uh took more samples or less samples so some of the data like wikipedia which is higher quality than for example common crawl uh like had like 3.4 epochs meaning uh the model saw it approximately 3.4 times whereas the common crawl was only seen 0.44 times uh one important thing to notice here and um comment is that they had a problem with data contamination which means many of the results uh they they later showed that the results are not as seriously affected some of the results that were serious effective were kind of just not displayed in the paper but i still think they they had some contamination issues uh left over especially on the pica data set where they are testing for understanding of the like uh uh in the world and we'll get to it a bit later um the compute was enormous uh so they had like you basically had like 10 years of pedaflop second like if you had a machine that has a pedophile second performance you need 10 years to train this model so it's huge in that sense you can see it here and the interesting comparison is that they say here as a consequence although gpt 3 3 billion is almost 10x larger than roberto large both models took approximately 50 petaflop second days of compute during pre-training uh the reason is that they uh there is other some other paper that tells that you shouldn't uh the model shouldn't see uh like tokens too many times otherwise it will overfit and the performance will get worse so that was that part about the data and yeah one more interesting thing maybe is that uh basically they needed a supercomputer because the model was huge he had 175 billion so they needed a like a supercomputer from microsoft to to train the thing so that's kind of a fun fact let me jump to something that's really like the core part of this paper and that's the how they uh do this zero shot uh like one shot and few shot learning and why they do it so basically why they do it is so they sit here so the reason to distinguish between one shot from few shot and zero shot is uh is that it most closely matches the way in which some of tests some of the tests are communicated to humans so yeah basically uh that's that's how humans uh that's how humans work and then for the zero shot uh there are some examples like for example translation if i told you to translate from english to german supposing you you knew both languages you'd know what to do you wouldn't need any additional examples but in some other cases you do need example and that's why the the one shot and few shot setting so basically in some cases it may even be difficult for humans to understand the format of the task without a prior example so this setting is in some cases unfairly hard so zero shot is going to be always going to have like worse performance and we're going to see some data that will back that up um also they say here nevertheless for at least some settings zero shot is closest to how human performed tests so that's the thing i mentioned about the translation okay uh having said that let's see some some examples here so basically on the translation example uh this is the zero zero shot setting you basically say you you give the mod you condition the model on on this sentence so you you give it translate english to french uh cheese and then prompt and basically you expect the model to auto regressively uh complete the sentence and translate cheese into whatever however you you say cheese in in french i don't speak french then there is the one shot where you give uh one example like sea otter de mer i don't know it's pronounced like that but let's say for the sake of it and then the future one where you give multiple examples and then you prompt it and expect the correct answer so depending on the tasks uh tasks some will make more sense and will make less sense and fusha basically almost always gives better performance which makes sense and let me compare that to to fine tuning uh which is basically what most of the other models in nlp so far did like like bert et cetera they always used to pre-train the model on a huge corpus of text and then they fine-tune the model on a specific downstream uh like a task and then that's how it pretty much works and the problem so they did they did notice a couple of problems with that there are some problems with doing that basically you kind of overfit to a more narrow distribution that's a one problem second problem is uh you need uh supervised like labeled data which is usually expensive and the third problem is that humans uh like don't need like labeled data like you only need a couple of examples and that's it so that's a few shot argument again uh here they say that they uh usually need around they usually put around 10 to 100 uh examples in the future setting depending on the like the basically they're they're uh constrained but this by this uh by the model size pretty much by the memory actually and they have only two 2048 tokens available so depending on the example size they can go from 100 to from 10 to 100. so that was that was that part and there are a couple of interesting charts uh in the beginning of the paper so this one shows us that as we are scaling so they pretty much averaged across different benchmarks which we'll see in a couple of minutes they average the performance and you can see that uh as the on the x-axis as the as the as the model size grows in size so as we slowly get to the 175 billion uh model version we get the the better and better performance and the the second important thing to notice here is that the difference in performance between the few shot one shot and zero shot also increases which kind of tells us that uh the model is becoming better at metal learning so those are two interesting facts to to see from this plot so the first one bigger the scale the the better the performance and the second thing is uh they are getting better at metal learning as the size increases which we can see by the difference between the different performances so basically going from from zero to one shot here you get less improvement than by going from zero to one to two to few shot uh learning here so that was that chart let me see what else we have here so here they show how the like the number of contexts uh number of examples in the context also affects the performance and they showed a couple of models so the the 1.3 billion the 13 billion and the biggest one the gpt 375 billion model and we can see that as we increase the number of of examples usually the performance uh also increases and this is for some specific task like uh insertion um trying to unscramble the word which which i'll explain a bit later but that's the basic idea so we have the zero shot we have the one shot setting and we have the few shot setting and the performance increases and the the second thing is the bigger the model uh the the the like the steeper the the the improvements in a sense okay and finally this is the the last chart uh in this section uh which tells us so they treat the model in a sense like a like a metal learning uh like a pro they treat the problem as a meta learning problem so basically you pre-train the model that's the the outer loop so that's all of those web crawl like common crawl data web text and common crawl data and then uh once you once you condition the model on on the on the like text uh like we saw in the translation example uh you basically uh that's the inner loop when you put the context you have you form in a sense ephemeral weights which define the model so the model is kind of uh like changes it's let's call it shape depending on the conditioning text so we have a completely different model which is in a sense fine-tuned to the new adapts to the new task which is the interesting thing uh before i jump to the uh results section and show you how it performs on different benchmarks uh let me show you one more really important chart so this chart shows us that the validation loss uh decreases as the compute increases and the x axis is the log scale and you can basically see so uh gpt3 which had this many parameters so that's approximately 175 billion params and had 3 000 days of compute so this is the gpt3 model you can see as the compute increases uh the the the loss decreases and it follows the like the the the power uh law here so basically smaller models with less compute um saturate at a higher uh validation loss which is an interesting um which was actually uh hypothesized by some of the previous papers and this paper showed that the law still holds even for huge huge models such as the gpt3 so that was uh all i had to to to to to tell you before we jump to the results section so pretty much nothing new in the research sense uh they they they pretty much like uh sweep the whole combinatorial space tried a bunch of different models so how it generalizes so the nice nice thing is that they are pushing for this few shot uh and zero shot and one shot like evaluation and not for the fine tuning uh approach which will which just gives you better results in the benchmark but it's not as as as applicable as having a like a general model which you can later apply uh on the fly which is really cool so the paper is really huge so i'm going to just show you a subset of interesting uh results and yeah so the first one will be i'll skip lembera those are some language modeling tasks and as expected because the model was gpt3 was pre-trained as a language model i.e predicting the next word in the sentence uh it performs pretty good uh on those tasks i'll skip this and i'll jump to translation let me just find the curves whoops just a sec this one uh what's interesting is that um basically uh the model was not explicitly trained to do translation and that's one of the funny things about this gpt3 model so basically was just trained to predict next next word but what emerges is a set of skills which seem which seem to be useful to know in order to predict those words and one of those skills is just like translation and i'm pretty like like surprised by the blair score it achieved so i previously uh reconstructed the original transformer paper and i'll i'll link the the like the the project in the description and basically i think i achieved around 33 uh blair score for like uh english to german and here without being explicitly uh trained to do translation and just having a small like subset of the text uh like in in german it achieved like a really really decent score and interesting thing is that going from like uh french to english german to english and romanian to english the performance is really good but once you go in the opposite direction it's really hard to achieve uh like a good good result so this like the the romanian has really has some serious problems it's around 20 which is really really low that's interesting and that shows that this model is a good english language model but nothing nothing not as good for other languages so they um they they notice so as i mentioned they noticed the skew and they said that this could be a weakness due to reusing the butt level bpe tokenizer of gpt2 which was developed for an almost entirely english training data set so bpe uh the bypair encoding uh was many people pointed out that bp can be causing different problems we'll see uh some problems that it's causing in the uh one of the tasks tasks i'm going to show you about unscrambling words and yeah just keep that in mind uh about the bpe uh the next interesting tasks i want to show you is the natural language inference where the model is having some problems doing those so the nli tasks and we can see here uh basically so that the task there is to uh read the text and uh understand how like sentences relate to each other for example you give it one sentence in the second sentence and you ask it to say whether the second sentence follows from the first whether it contradicts the first sentence or whether they're just neutral and uh basically uh that that's really interesting because that means that the model is not doesn't doesn't handle comprehension like in reasoning as well as it does uh just like uh generating uh sequences like translation or or language modeling or like completing sentences so different tests that has to that have to do with generation and we can see the results here uh that uh we have pretty much random behavior on smaller models and then as the model grows uh the the few shot uh model actually improves the performance but it's still way below the baselines like even even bert and robert is even better so those are some of the tests that uh this model is having like problems with so a stands for adversarial so th is actually a subset of those nli examples where humans picked some uh some examples which are especially hard for uh language models and yeah so it's struggling with those types of of problems now my favorite ones uh are these synthetic tasks so what they did is they tried and they tried and prompted the model whether it knows how to calc so how to do some basic addition subtraction and multiplication and here we can see uh results so basically for uh two-digit uh addition and subtraction uh the model has really good performance at least the the biggest one and we can see like huge improvements as the as the as the model size like increases again uh but then when we go to three digit uh addition perf goes down and it goes even further down for four digit additions protection for multiplication and for some uh three uh three-digit um operations so what what what is probably causing this is that like in the like in the huge text that the model was uh c has seen during the pre-training uh there were a lot of tables that had like two-digit numbers but less so for three-digit numbers and even less so for four and five et cetera so basically this may be uh indicative that the model is not learning how to to reason and actually uh it doesn't adapt in a meta-learning sense to this new task and learns it from a few examples but it just do some but it is just doing some kind of a statistical like uh like pattern matching and stuff so it's uh it's still unexplored what exactly happens in the future setting but like uh like a probably good hypothesis is that it's not reasoning in the sense we humans are reasoning okay so that was an interesting example so one more really interesting task for me was this uh word scrambling and manipulation task so you they had five subtasks here so the first one was cycle letters invert so basically you give the model something like this and the mod is supposed to figure out that this is actually just a circularly uh permuted so you you unscramble it into inevitably here then we have anagrams where you keep the first and the last letter and you scramble everything in between and that's what you give to the model and you expect the model to unscramble the word and this is actually super tough i had problems uh also on scrambling this word so i i thought it had something to do with crypto but then yeah it doesn't have any y but it's still it's kind of hard and then this one is a bit easier you hold the first two letters uh fixed and the last two letters and you just unscramble the middle and you get opponent then they have a random insertion in the world where you basically just after every single character you insert a random punctuation or like a blank space and you expect again uh the model to unscramble it and finally we have just the reverse word so but just the model should figure out that you should just reverse the letters and get uh objects here so those were the tasks and uh a bit surprisingly uh so the the the re so even the even the uh the the biggest model the gp3 the 175 billion one uh had uh zero like uh accuracy on the reverse words so that's the first intuitive thought i had like like what's happening here and then you figure out that the model is using uh sub tokenization so it's really hard for it to do this task uh whereas the easiest one was random insertion which was also easiest for me actually and then you can see the anagram the one word enneagram is having uh problems so that mirrors my my difficulties with understanding with doing this a1 uh task as well and then the uh a2 anagram task was a bit easier so that's the one where we keep the first two and last two letters fixed so that's actually all intuitive so the same difficulties i had solving these problems the model had a similar difficulties there are two more interesting tasks i want to kind of explore a bit more in depth so the first one is the set analogies and here you can see what the problem looks like so this is in the appendix part of the paper basically the context is uh lollies to trust as and then you have a multiple like answers here multiple choices and you're supposed to pick out the correct one and i don't know about you like english is not my native language and i think i'm pretty good at it but like uh some of these words like cajoli or belk were not familiar to me so this problem actually turned out to be really difficult even for me so i was just thinking how we are like testing our models really like we're really heavy even the humans are not like uh not nearly as smart as we as we think they are so i mean yeah that's just something that struck me and was interesting so let me go and see how the model is actually performing on the set problem so on the set analogy problem so what was interesting for me was that uh actually gpt 3 achieved higher score than average high schools student uh so on this task gpt3 achieves 65.2 in the future setting whereas the average score among college applicants uh was 57 percent and we can see the uh like the curves here again the few shop one is uh growing steadily and is always the best for the is for the gpt3 model so one thing that we can maybe conclude from this uh result is that we shouldn't be teaching our kids to just uh do these tasks that are really easy for like like big neural networks to do uh they should be focusing more like on things that the networks can do uh like uh reasoning and maybe art or stuff but even art is something that's already in the realm of neural networks so i guess that's b that is going to be a complicated question that we have to solve but let me not digress too much um the final problem that i wanted to talk about is the news article generation and um the results here are really really striking uh basically if we show if i show you the plot here as the model size increases so here on the x dimension on the x axis you can see that the accuracy goes down meaning that people are less and less confident that an article whether the uh like text came from the like from the model or whether it came from the from humans so they did a really nice uh like study where they tested a bunch of models uh also they used something called control model which was a smaller gpt3 version and it it intentionally had like uh like a higher temperature which increased the randomness of the softmax so basically uh it was more random than all of the other models and they compared all of those models with the control model and what it turns out is that as we go as the models increase we end up pretty much with the fact that people can decide whether an article came from human or from from from the gp3 which is amazing and has some probably scary repercussions so those were some interesting results that they showed in this huge paper this video is already getting too long so i'm going to tell you a little bit more about uh a couple two or three more things so one is data contamination so what happened is that because the the training data set they have is so huge uh it ended up they ended up having some of the dev and test data from the benchmarks already being included in the training data set and then uh they did some investigations and for the most part they said that um like the the tasks weren't as affected uh as they initially uh expected because there was a bug in their filtering of the data which caused all of this so um yeah but some of the tests like pica where uh they had um physical let me see if i can find it so this is one example from from pica data set so basically the context is how to apply sealant to wood and then we have the correct answer using a brush brush on sealant on the wood until it is fully saturated with the sealant and then they have the incorrect one so as as they showed uh the results on peacock were really good and because it looks like it's an outlier i do think that contamination played its role here okay getting back to this chart um i just want to note one more thing and so they had a bug so they said here unfortunately a bug resulted in only partial removal of all detected overlaps from the training data due to the cost of training it wasn't feasible to retrain the model so even i think that they are here they're using some kind of like they're using 13 grams to to figure out the overlaps and they didn't mention somewhere in the paper that this is still a new area of research so i'm really skeptical about this part how they are and what do they consider but like uh by a duplicate so it's hard to figure out which parts of text you should filter out so that you can say with certainty okay this these examples from the dev and test sets are not present uh in any way in the training sets in in a way that could bias and help the model to better predict uh on those dev and test sets so i mentioned pica we can see the it's an outlier so it's it has better performance you can see here that like uh the the points on the uh lower part of this chart have better performance on the uh dirty uh data sets and so that means that uh that there was a contamination issue uh what i'm not clear about is this drop they did mention it but it looks like a huge uh like uh uh outlier but they did didn't mention anything specifically about it they did mention pica and vinograd et cetera so that was that was it about this part uh now i want to just walk you through some of the limitations they mentioned so the first one is this one so gpt3 samples still sometimes repeat themselves semantically at the document level start to lose coherence over sufficiently long passages contradict themselves and occasionally contain non-sequitur sentences or paragraphs so even though this is a huge language model it's still uh it still has its own problems like repeating uh so there are still uh those there's still a lot of research to to do about how do we decode the output from these models and whether the problem is the model itself or the heuristics we are using for decoding so uh basically gvern and most people are using the the top p the nucleus decoding and they're they're playing with temperatures so it still um doesn't feel like uh as a correct way to do things uh did i mention that they are obviously aware that if they had bi-directional representations uh compared contrast to to the ones that do they do have and that's the causal uh the the unidirectional representations they they expect better results so that's a thing they could try out they also acknowledge that the pre-training objective could be further improved and they also acknowledge that uh the understanding precisely how few shot learning works is an important unexplored direction for future research so yeah basically nobody quite knows how this exactly works it's still kind of not easy easy to interpret and that's one of the bad things so they mention it here it's decisions not are not easily interpretable uh also um the pure scale of the model doesn't allow them to iterate when they make mistakes as we saw on the example with the data contamination so a limited limitation associated with models at the scale of gpt3 regardless of objective function or algorithm is that they are both expensive and inconvenient to perform inference on which may present a challenge for practical applicability of models of this scale in their current form um so i guess some some sorts of uh knowledge installations and uh like pruning etc could help us uh like make smaller yet performant uh models i want to close up this video talking about the broader impact so i do think this is important even though it's not uh like a technical part so feel free to skip it if you're not interested in it so basically uh what they consider so with gpt2 they had a staged released so that happened last year and they initially released the small as gpt2 model they were monitoring like different forums etc and just looking for any signs of potential misuse once they were pretty certain that the that no misuse was happening they started gradually releasing the models and at the end of 2019 uh the whole gpt2 the biggest one i think it had 1.5 billion per params was published so gpt3 is still not published and it probably will never be but the the the api is already accessible for certain for certain people it's in beta okay so uh what i wanted to to to say here is that um uh it's even more important to consider these questions for a gpt3 uh because uh basically uh whatever like malicious task uh that depends on like uh producing a huge amount of text gpt3 will help with that so that's phishing that's spamming uh like a bunch of different stuff so social engineering so that can be uh automated using gpt3 you know in a sense so they they did consider uh those things so many of these applications bottleneck on human beings uh to write sufficiently high quality text and then they mention like uh who who has the resources to do this like government groups i don't think this part is as interesting as this part about the bias so they consider it three axis so one is gender the second one is religion and the third one is let me check race yep so uh how did they how they evaluated the the bias in these models is they basically prompted with uh sentences like this one the occupation was up like for example that and they note here the detective was up and they are looking uh at the uh like what's the what's the like the probability distribution from the coming out from the gp3 and they're looking at like male and female and what i figured out is that most of the occupations are biased towards towards meals so especially uh when they put it like this the competent the competent then fill in the blank was up and uh they they figured out that that version of the prompt had even higher bias for for males and uh actually also the incompetent one was also biased more towards george male so again iterating a little bit on this one here we can see in particular occupations demonstrating higher levels of education such as legislator banker or professor uh were heavily male leaning and also like physical labor stuff was also male leaning where where the occupations that were more likely to be followed by female identifiers uh include midwife nurse receptionist housekeeper etc so once more how it works you you basically take uh take a variant of the prompt like this one you input it into your gpt3 let me zoom in a little bit so you basically input it into a gpt3 model let's say this is gpt3 and it outputs as the next token so the probability distribution which has the size of the uh gpt3 is vocab vocab which i think is around 50k and what it did then was they they they took they they they monitored a couple of male identifiers like man mail etc and they monitored like a female woman etc and they just pretty much uh added the probabilities and compare those they also did some normalization so basically that's the method how they did this and and i already mentioned the results that came out of it so that was about the gender then i'll just briefly uh go over the race uh the so across the models the unl analyzed asian had a consistently high sentiment whereas the black had a consistently low sentiment and this was kind of surprising because you usually hear that like uh white males are are much better much better position than than asians so there was this was kind of surprising in in in a way for me and yeah that's that's an obvious bias and um especially for for black people like a lot of the text has uh uh is referencing like slavery uh etc and then uh that's the reason black people are have the the lowest sentiment they did mention it here so the resulting sentiment can reflect socio-historical factors for instance texts relating to a discussion of slavery will frequently have a negative sentiment and that's the reason why the uh black uh black race always has lower sentiment not like for every single uh model size so x-axis is the model size here is the the biggest one finally religion uh they they tried out like the top the the the most popular like world religions like christianity buddhism islam et cetera and so they found some negative biases about islam uh like we also found that words such as violent terrorism terrorists co-occurred at a greater scale with islam than with other religions and were in the top 40 most favored words for islam in gpt3 which is really sad and we need to be aware of these biases um because obvious i won't get into this but like most of the muslim people are really nice folks and that's it they did um also consider the energy usage and one really surprising uh fact was this so though models like gpt3 consume significant resources during training they can be surprisingly efficient once trained so even with the full gpt 375 billion per am a large model generating hundred pages of content from a trained model can cost on the order of 0.4 kilo kilowatt hour or only a few cents in energy costs it's actually pretty efficient once you train this huge monster and then the the like the during the lifetime of the model it can be the the cost the energy cost can be amortized uh potentially yep so that was pretty much all i had to say uh i could make this video much longer uh i mean this is a huge paper so hopefully you liked it uh if you did please leave a comment what you like what you didn't like uh what what are your opinions on this and yeah uh regarding the like nagging versus the hyping sides i just want to mention one thing so what's my opinion on this one so i really think that these huge models like huge transformers are not not something that's a digression or a bad thing for the like future of ai why do i think that that's because like your brain like most of your brain like cerebellum where there is like the the the largest amount of computation happens in cerebellum is actually unconscious so uh we are probably uh now on the phase where we are developing like a synthetic primordial brain uh basically the equivalent that we have which is pretty much regulating all of our like uh vital uh like uh functions like controlling the heart temperature etc and also perception so just kind of extracting concepts from image and audio etc so all of those require a lot of computation and i don't see this as a digression i just see it as a supplement and as a step towards the artificial general intelligence which will which will happen once we are able to simulate the cognitive layer above the primordial brain level so those are just some like two cents on my side uh about what i think about this and i have some gut feeling that like graph neural networks and causational relations that judea pearl is like uh has been pushing for decades now uh will play a significant role as well as symbolic ai so i don't think we should kind of disregard all of those and we should just be like cognizant that all of those will probably somehow play a role uh like in this long-term like goal towards achieving uh agi so that was pretty much it if you like this video consider subscribing to this channel and hit that bell icon to get notified when i upload a new video and until next time keep learning [Music] you

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video, I cover the famous GPT-3 model. I first give you some context about the stuff that happened since the paper was first published in May 2020 (hype, anti-hype, limitations, and cool apps), and then I dive deep into explaining the paper. You'll learn about: ✔️ Useful resources on GPT-3 ✔️ Main takeaways from the paper ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ "anti-hype" blog: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html ✅ Gwern's blog: https://www.gwern.net/GPT-3 ✅ My transformer implementation: https://github.com/gordicaleksa/pytorch-original-transformer ✅ Cool "GPT game": https://play.aidungeon.io/ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 GPT (anti)hype, Gwern, prompt programming 04:30 Abstract of the paper 06:50 Architecture, data, compute 12:15 Zero-shot, one-shot, and few-shot learning 18:45 Power-law chart (more compute please) 20:35 Results (machine translation) 23:05 NLI (reasoning is hard) 24:40 Arithmetic 26:25 Word unscrambling 28:40 SAT analogies (how smart are humans?) 30:45 Fake news generation 32:05 Data contamination 35:05 Limitations of the model 37:35 Bias, fairness (broader impact) 44:30 Final thoughts, are we going towards an AGI? ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https:
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 22 of 60

1 Intro | Neural Style Transfer #1
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
2 Basic Theory | Neural Style Transfer #2
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
3 Optimization method | Neural Style Transfer #3
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
4 Advanced Theory | Neural Style Transfer #4
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
5 Anyone can make deepfakes now!
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
6 What is Computer Vision? | The Art of Creating Seeing Machines
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
7 Feed-forward method | Neural Style Transfer #5
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
8 Alan Turing | Computing Machinery and Intelligence
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
9 Feed-forward method (training) | Neural Style Transfer #6
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
10 What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
11 Semantic Segmentation in PyTorch | Neural Style Transfer #7
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
12 How to get started with Machine Learning
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
13 How to learn PyTorch? (3 easy steps) | 2021
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
14 PyTorch or TensorFlow?
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
15 3 Machine Learning Projects For Beginners (Highly visual) | 2021
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
16 Machine Learning Projects (Intermediate level) | 2021
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
17 Cheapest (0$) Deep Learning Hardware Options | 2021
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
18 How to learn deep learning? (Transformers Example)
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
19 How do transformers work? (Attention is all you need)
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
20 Developing a deep learning project (case study on transformer)
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
21 Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
23 Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
24 Attention Is All You Need (Transformer) | Paper Explained
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
25 Graph Attention Networks (GAT) | GNN Paper Explained
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
26 Graph Convolutional Networks (GCN) | GNN Paper Explained
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
27 Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
28 PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
29 OpenAI CLIP - Connecting Text and Images | Paper Explained
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
30 Temporal Graph Networks (TGN) | GNN Paper Explained
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
31 Graph Neural Network Project Update! (I'm coding GAT from scratch)
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
32 Graph Attention Network Project Walkthrough
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
33 How to get started with Graph ML? (Blog walkthrough)
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
34 DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
35 AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
36 DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
37 OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
38 MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
39 EfficientNetV2 - Smaller Models and Faster Training | Paper explained
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
40 Implementing DeepMind's DQN from scratch! | Project Update
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
41 MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
42 DeepMind's Android RL Environment - AndroidEnv
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
43 When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
44 Non-Parametric Transformers | Paper explained
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
45 Chip Placement with Deep Reinforcement Learning | Paper Explained
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
46 Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
47 Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
48 GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
49 VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
50 VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
51 Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
52 Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
53 AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
54 RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
55 DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
56 DETR: End-to-End Object Detection with Transformers | Paper Explained
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
57 DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
58 DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
59 Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
60 Fastformer: Additive Attention Can Be All You Need | Paper Explained
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany

The GPT-3 model is a few-shot learner that can generate text, code, and other content with minimal training data. However, it has limitations and biases that need to be addressed. The model's performance, applications, and potential uses in artificial general intelligence are discussed in the video.

Key Takeaways
  1. Condition the pre-trained model on text and prompts
  2. Explore the effects of increasing model size and few-shot performance
  3. Compare the model's performance to humans and current NLP systems
  4. Discuss the broader impacts of the technology and potential malicious uses
  5. Use smart hyperparameters and conditioning text to improve model performance
  6. Recognize the importance of fine-tuning in language models
  7. Apply language models to multimodal tasks
  8. Understand the basics of prompting in language models
  9. Design effective prompts for language models
💡 The GPT-3 model is a significant step towards artificial general intelligence, but its limitations and biases need to be addressed to achieve more accurate and reliable results.

Related Reads

📰
How I Stopped Fighting Hallucinations in LLM Data Extraction
Learn to stop fighting hallucinations in LLM data extraction and improve your data quality
Dev.to · zhongqiyue
📰
Anthropic’s Claude Sonnet 5 Is “Near-Opus Intelligence” For All Plans via @sejournal, @martinibuster
Anthropic's Claude Sonnet 5 model offers near-opus intelligence for all plans, including the free tier, with introductory pricing on tokens
Search Engine Journal
📰
Understanding How LLMs Work: From Text to Tokens, Embeddings, Transformers, and Predictions
Learn how Large Language Models (LLMs) process text into tokens, embeddings, and predictions, and why understanding their inner workings matters for AI applications
Dev.to · Klinsmann R
📰
How ChatGPT Understands Your Questions: A Beginner-Friendly Guide
Learn how ChatGPT understands your questions and improves its responses with fine-tuning and context understanding
Dev.to · Shreyas Rasaikar

Chapters (15)

GPT (anti)hype, Gwern, prompt programming
4:30 Abstract of the paper
6:50 Architecture, data, compute
12:15 Zero-shot, one-shot, and few-shot learning
18:45 Power-law chart (more compute please)
20:35 Results (machine translation)
23:05 NLI (reasoning is hard)
24:40 Arithmetic
26:25 Word unscrambling
28:40 SAT analogies (how smart are humans?)
30:45 Fake news generation
32:05 Data contamination
35:05 Limitations of the model
37:35 Bias, fairness (broader impact)
44:30 Final thoughts, are we going towards an AGI?
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →