A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Latent Space · Advanced ·📄 Research Papers Explained ·2y ago

Skills: Reading ML Papers90%Research Methods80%LLM Foundations80%LLM Engineering70%

Key Takeaways

The video discusses a comprehensive overview of large language models, covering topics such as the history of attention, post-Transformers era, and the GPT era, with a focus on paper reading and research methods. The discussion involves various concepts, including conditional language models, encoder-decoder architecture, and attention mechanisms, with skills such as llm_foundations, llm_engineering, and paper_reading being highlighted.

Full Transcript

all right that's cool all right cool so hey guys thanks so much for coming by the uh paper Club as usual um this is a paper club we run out Asia where we go through one paper every week uh so today we're just recording it for the first time and uh we hope that you benefit from it so as usual if you guys got any questions you can either like let me know and I can invite you guys in stage or you can drop in in the chat uh which you can access by just clicking the button on the top uh just the little like message icon and yeah you want to take it away Brian sure thanks Ian so um today we'll be going through um the comprehensive overview of large langage models uh but on top of that I think what we want to do also is just to share you know the reason why um attention actually came about uh before the Transformers paper so we'll have a little bit of um a history lesson on that on why it was developed and then we will go through um the paper talking about what has happened um post the Transformers era in fact it's when the GPT era started so I'm going to begin um as you can see the link um has two parts so I'll use the first part to talk about pre I would say GPT and then I'll use the second link to talk about the paper prop so um let's begin so essentially what models have been trying to do recently is this idea of language modeling where given a previous sequence of Words which is your input or your prompt you want to find out the next word in the prompt right in this case it can be question and answers so uh it can be modeled essentially by this probability of the next token given the sequence of tokens so that's when you can see uh the next token which is t+1 over um the given the sequence over here up to t uh time equals to T position equals to T and of course your t+1 is a sampled a sampled from the vocabulary that you have which is basically your subwords or the tokens that you have okay so um why is this the case I think for us uh who are doing NLP Beyond just thinking about looking at what the sequence is what what what's being generated in the sequence it's to think about what kind of use case or what kind of task we doing and I'll say this is very useful when it comes to thinking about the uh evaluation metrics for each of these evaluation tasks so you can be your screen just kind of like cut out for me is it okay let me see oh wait sorry no Noor okay it works again sorry my bad it just need disappear it works yeah no problem so um things like machine translation that will be talking about um got question and answer summarization so on and so forth so essentially um good to think about what task uh we are trying to attack uh when we are using the different models right so while we think about language models as uh predicting the next token it's also useful to think from a linguistic perspective um what is being learned by these models so um there's a list over here I'll just go through a few that useful um things like facts which is trivia so these are the ones where you know you can say um the penalty for getting the prediction wrong is relatively higher because if you have if you output something that's false then U your language model is probably not truthful um things like um sentiment which we have seen before things like reasoning so in this case if you look at the sentence uh oo went to the kitchen to make some tea standing next to oo Zuko ped his Destiny Zuko left the so in this case the idea is that um there is some sort of spatial um understanding uh the model needs to understand some special understanding of the sentence in this case um Zuko is currently in the kitchen so he left the kitchen right so these are some of the things um that from a synthetic perspective or from a linguistic perspective we observe uh models are learning in terms of patterns so from uh language models we talk about conditional language models so essentially the idea is that we are trying to um generate a Target sequence in a Target sentence given some sequence in the source sentence so that is why uh you see over here that we are not just generating our y T given some y1 to YT minus one which is basically the um sequence that has been generated by the model before but also we want to condition it on the source sentence right so that is essentially what translation does you give if you think about it you give um the model A source sentence you pick the target uh language and then you observe the model generate the sequence in the targate language right so it's more than just language modeling but it's also conditional right and one of the key things that we will notice in language uh conditional language modeling is that we don't necessarily see um that the first word in the source sentence corresponds to the first word in the Target sentence so as you can see this might be it uh first word to first word but just the second word onwards you start to see that there is this sort of um CR cross uh relationship where you might need to where maybe the second word over here corresponds to the third word and the third word over here corresponds to the second so essentially the idea is that um we want to find a way to be able to model this relationship and um this relationship has actually been studied before in this idea of alignment where if you think about it if let's say we got the CL sentence let's say um on the top and the target sentence on the bottom on the on the left then if we've got this very linear one toone relationship uh or this monotonic relationship then we will see that there is this there will be a white box over here from the top left to the bottom right indicating that the first word corresponds to the first word second word corresponds to the second word so on and so forth but as you can see just from English to French um there is this idea where words um that is uh later in the sequence corresponds to words that's earlier and vice versa so that is how we can um visualize attention so then the question is okay what how are we um in a sense modeling it or what does it look like from the encoder decoder perspective so so naturally when we look at the encoder decoder blocks Um this can be um a let's let's look at this as an RN right we say that the hidden State uh the last hidden state in the encoder block um contains all the information of the entire sentence but there's this information bottleneck problem which means that if let's say this is a longer sentence the last sence state might not contain information of the earlier to and therefore there's this idea of attention where you have given that you've got all the hidden states of all the input tokens the decoder when during the during the language generation uh component will pay attention or attend to uh waed sum of all the hidden States so if let's say I've got something that is uh later in my sequence um that corresponds to a token that is earlier in my source sentence then I will see uh the attention weights giving more weight to the um hidden states in the sour sentence so essentially that's the idea um of attention that has been implemented in the um encoder decoder kind of Paradigm or the kind of architecture so the problem with that is that when we uh create these or we calculate these individual hidden States we realize that it has to be calculated sequentially that means in this diagram you can see that the second hidden state has to only can only be calculated after the first hidden state is being output and the third hidden State can only be calculated after the second hidden state has been output so the question is um can we remove or break free from this idea where there is a dependency of the previous state because if we're able to do so then we are able to run our forward paths and collect our gradients and run back prop on the architecture uh con uh concurrently across the whole sequence right so essentially that's the uh idea of your key query value attention U and that essentially forms um one of the building blocks uh of the Transformer architecture right so um I think from here what we're just going to talk about is um there are other components to the Transformer architecture Beyond just our uh key query value attention um there is also this idea of understanding um the position of the text and that's basically an idea of adding position representations that you will see in the paper later um adding some sort of nonlinearity uh when you're doing the calculation and that's essentially just adding a fit forward layer on top of it so the idea is that if you're just calculating qery Value Pass you're always looking at linear uh combinations uh of your you can say your values because you're just um getting a weighted some of the values calculated by attention so we want to add a layer of non linearity to it which is U taken care of by the feed forward Network and of course the last part is uh when you're doing the decoding step when you're generating the tokens um you want to not let the model see the future tokens and essentially that's when masking comes into play uh attention masking comes into play so you will start to see that uh in the decoder architecture um later down the road okay so a couple of things uh on top of what we are talking about in terms of the uh the language modeling component for Transformers um one topic is subw model so this is when you have things like tokenization uh your B pan and coding so essentially what are we trying to solve over here if you look at this um table at the bottom we start to see that for words that exist outside the vocabulary that can be things like a variation of an existing word in this case you add many A's in between um the word between T and A for tasty to probably indicate that is very tasty or misspellings of Words which is also very common in input or novel words over here where we understand the word transform FY might mean um adding maybe a Transformer uh block into an existing architecture but it's a word that we might not see in the existing dictionary so for them for these words over here if you just use a traditional vocabulary or a dictionary vocabulary the index will be some sort of an an token right um and essentially what goes on with uh B pair encoding is that it starts to learn these shorter um combinations of letters that can sometime um represent either prefixes or suffixes of a word um and then essentially you are able to generate the embeddings for them so if you see over here you've got this TAA um and then anything after that and AAA and anything after that and St so this guy probably you've seen it in other uh existing words um and therefore that is an existing embedding that's associated with it and therefore we are able to represent it over here you can think of it maybe as a you essentially creating you're generating three tokens from this uh Source sequence over here so essentially that's the idea of um subw models or in this case you've got things like uh backp coding sentence piece word piece and things like that essentially that's the problem that they're trying to solve okay so um three types of uh architectures um the key thing over here to note is that what we have in the Transformer block is essentially replacing um the recurrent neuron Network blocks that we had previously so we talk about recurrent neuron networks of course we add things like lstms grus bir directional models multi-layer models so it encompasses all that and essentially what we have over here are the three types of AR dominant architectures uh we've got the encoder models and examples of this will be things like birds where you learn via Mass language modeling which has been covered before um encoder decoder models where we've seen earlier we have an encoder um that Maps your sequence in to um a space or a position in latent space and then from there you perform your um sampling or your order regressive sampling of tokens to form your target sequence which is what we've seen in T5 and the decoder models which I think all of us are familiar with things like gpt2 gpt3 they are all there so you essentially learn um the language of patterns and then you directly just um do your uh Auto regressive sampling of decoding from there okay so from there right we will lead to this paper that we have over here which is the comprehensive overview of flash language models if you take a look at this paper um it seemed to me that there were multiple updates to the paper and that signals to me that there's probably going to be updates along the way so I think what's useful is beyond looking at just the paper itself um understand of for me what I did was I tried to understand what was the framework that the authors were using to attack um understanding of um the knowledge um then dividing it and then giving us a a reader to understand it right um it's a very dense paper um it's got I think over 450 citations so um I think it's more of a pick your own adventure pick your own Journey pick your own um learning process uh kind of I would say Direction um so that along the way you'll be able to build the foundation knowledge and then uh add layers on it add layers on it the end of the day we all know that new models are always developed and new models are always announced um so going back to the first principles and fundamentals are useful so um let's just go through the paper very quickly um let's just start from the top over here so essentially we'll just talk about the last point over here where we are seeing that large language models uh in particular things like gpt3 are able to perform your Downstream tasks without specific fine tuning so that's the first key point because if we looked at T5 uh um we saw that the performance of T5 on Downstream tasks in this case it can be translation it can be your glue task it can be your squad task um their performance only will get better once you fine tune on that particular task right and you've seen there there are multiple experiments that they've done uh which demonstrates that that's the better way that's the better alternative so um what gpt3 demonstrated was that they are able to perform zero short transfer learning on these tasks so what does that mean that means that if you just give the prompt uh from the downstream task gpt3 is able to give the answer so that kind of changed things where we actually might not need to find tune for a particular task of course when we look later down the road we see that there's um very very specific ways of doing things like instruction tuning right but that was one of the big discoveries that they had back then on top of it uh um they were able to show things like reasoning they were able to show things like planning they were able to show things like in context learning so we we get to see them uh examples of this later when you do things like uh chain of top proping so they're able to understand you know like given certain patterns um when they ask for the ne when they ask for um a question that or ask for a task that's follows a similar pattern from The Prompt they are able to answer um the problem that we see today is that um the cost of training them or pre-training them is relatively High usually in the T of milons um so the question is that can we get better at pre-training these models um can we look at things like better architectures we look at things like uh more um efficient ways of um fine-tuning a parameters U are there ways that we can represent these vectors in a lower uh Vector state or or a low or a state that is that uses less um granularity right so that's essentially what things like architectures come into play uh quantization comes into play so um the way I I saw this paper was that we had the background which talks about some of the key Concepts um and then the different types of llms and their are particular use cases um the data sets that have been used to train them at least the public ones what kind of evaluation tasks are they looking at so probably that's what we call evals um and the different types of applications uh for these llms in the commercial world and of course um from there we talk about you know what uh probably researchers are looking at um going into maybe say the next three months or the next year so let's look at some of the fundamentals so I'm going to start from the left side um the paper is covered we have covered some of these topics from the paper tokenization um attention mechanisms um the different types of activation functions so those are stuff that we've learned uh you can get a recap when you do your traditional uh deep learning um topics then of course we talked about the different types of architectures which was covered earlier your encoder only your encoder decoder your decoder only and naturally each of them will have their own Associated way of doing attention masking so that's this part over here um we talked about the different types of pre-training objectives naturally things like Mass language modeling are things that we see in um your encoder only models uh language modeling are things that we see in your encoder decoder models so Mass language modeling basically in this diagram is uh you Fe you give the model this token and these tokens over here and the model is expected to predict these targets over here the ones that have been highlighted whereas in full language modeling so essentially it's like a fill in the blank kind of problem whereas for full language modeling you give the first token and then the model is expected to predict the second third fourth fifth token so on and so forth so that's that there have been also um Research into this thing called prefix language modeling where you feed the model um one part of the sequence and then you're asking the model to um generate the remaining uh parts of the sequence and what's useful over here is that when they do prefix language modeling they use this thing called um a causal M A causal mask with prefix which means that the for the input tokens the model is able to see or attend to all the previous tokens in the input before it starts to generate output and that's why when you see uh as the model generates the output you still you still have that um element of mass attention so essentially that's this part over here um things that are I would say if you look at the Transformer paper which is which is covered over there will be things like um layer lonization where you um you divide the weights by the mean sorry you minus the mean from the weights and you divided by the the standard deviation of the weights essentially what we're doing is they were trying to achieve uh numerical stability uh of the weights right so that as when you when you do a for pass and you do and you do your back propagation you don't have numbers that go uh all over the place so that's uh layer normalization um positional encoding something we talked about earlier um in the original paper they had this idea of cidal position representations so how to read this uh graph okay so essentially how to read this graph is that as you go from left to right in the uh as the index of the sequence increases essentially you're applying some sort of uh sinusoidal function on top of it such that every uh token in the sequence has a positional representation it's augmented by a positional recommendation so essentially from left to right all these vectors actually look different right uh but what happens is that uh this way of encoding position representations um is not learnable because there is no such way to do um it's not such way to have a gradient and then to um update the positions so therefore uh it has been changed to something as simple as just adding um a position representation on top of the uh embeddings and of course if you look at the paper there are also new ways to do it things like Ali things like um rope so that's the left hand side now on the right hand side over here we are looking at newer ways or uh ways that can help with training or implementation so things like um the libraries that we're using um Jack py toch tensor flow amongst others um there's this idea of distributed training which means that um can we use multiple gpus to train uh our models so that we are able to learn that weights faster so amongst others there's this idea of data parallelism where you duplicate your model in two gpus let's say I've got two gpus I duplicate my model in both gpus and then I run separate batches on top of them so let's say I've got a batch of I don't know 100,000 right I split it into 50,000 50,000 I run the first patch of 50,000 in the first GP in the first model in the first GPU and then the another 50,000 in the same model in the second GPU calculate the gradients average them and then perform my back so that's what data parallelism is um tensor pism essentially the idea is that um you calculate the uh matrix multiplication steps uh in multiple gpus and then you add them up so what happens is that as you can see uh we know that for each row um the calculation the multiplication with a column can be done concurrently and therefore it splits it up such that the first that this uh the The Matrix on the left multiplies with only one column Matrix on the right multiplies on the second column and then you combine them together or in this case you concatenate the results together so that again also helps us uh with getting uh the results from the forward path a lot better okay so that's that um other kinds of tricks that we are using uh things like flashh attention where um it's a very smart way of utilizing uh memory so what happens is that instead of calculating instead of a series of steps that is very memory intensive when they load your um your query your key query matrixes through the calculation perform the soft mix and then get your results um they are doing some way or they're iterating it uh and they using very smart functions um to calc at things like the uh soft Max of the fly so essentially that's what they're doing over here so it's an optimization of um using your high bandwidth RAM and also the the ram in your your GPU right because in your gpus you've got very fast computation but relatively lower memory um just a little bit extra this is one of those very common topics that they would like to um start off as they go into things like your for uh Mamba models so that's just the first part so the second part in terms of um the background will be how do we adapt these models for specific tasks so um there are things like transfer learning which we've seen before where we pre-train um our T5 based model and then we find tune on individual tasks um there's also things like instruction fine-tuning Where We Are where the model is given a series of instructions and outputs and then the model will will fine-tune uh its outputs based on that so examples of this can be things like if let's say I asked GPT to explain the moon landing to a 6 in a few sentences generally in if the model is pre-trained there is this way where um GPT outputs the steps in this way right so explain the the theory of gravity explain the theory of relativity to a six-year old and then explain the Big Bang to a six-year-old and then uh explain the evolution to a six-year-old so that's how gpt3 will output in sentences but if we are able to do some sort of instruction fine tuning where um there is some sort of emphasis on things like six-year old in a few sentences then um this is the kind of output that you can get and so that's the kind of uh variations of different models that we can see when we download them uh from open source um a repositories I think like huging face right so that's instruction find tuning over here um and something called alignment tuning where you want to ensure um that your model uh fulfills what people call the three h the three ages of uh Model Behavior so your models will be harmless your models will be honest your models are helpful so things like harmlessness will be things like if let's say um how can if let's say you ask the model how can I um let's say bake a cake with C right if let's say your model is not align alignment tuned uh the model might give the the instructions but let's say if you do alignment fine tuning to tell the model hey this is something that you should not output or you you should um you shouldn't give instructions for then um the model will learn according from that so these are some of the methods that we want to find you our models with such that our models are able to demonstrate a certain Behavior then how are we doing it we can use things we can use skills like uh reinforcement learning to do it um where essentially you for each of the different outputs you have a certain kind of reward in this case the reward is just a scalar value uh and then you learn some sort you you learn some sort of uh policy such that when the model outputs tax based on this policy uh you get to maximize the reward so the key thing over here is that uh the policy has to be differentiable so that when you get some results from um the model output and you get some reward uh sometimes your reward might not be good or you're comparing rewards you're able to get the loss of the reward and back propagate it back propagate it through the gradients to update the weights uh in the policy so that's essentially what reinforcement learning is um so in typically for I think when when reinforcement learning was a hot thing back then it it's it's one cause by itself right so this is just a very high level um five six uh five minute uh overview of it on top of it I think one of the things you are more familiar with is things like prompting so we've got zero shot prompting where you just ask for task uh you you just give a task and the model answers directly uh but also you have things like Chain of Thought prompting where you give the you give the model some examples before and then from there um ask the model to mimic the behavior of the examples above so that's essentially what you have over here you've got uh in context learning of I would say translation on the on the right and you've got uh in context learning of correcting uh spelling mistakes on the left so that is essentially um this part over here and you got to see that uh few shots or things like uh five shots or three shots usually have better performance against um your zero shot or one shots so that's this part over here um and then the question of course is you know how do you craft um this uh these prompts such you'll be able to get the results that you want so that's essentially um the idea of what people like to call Prompt engineering okay so that's essentially the part of the backgrounds that we want to cover um the next part over here I would say is uh a very brief list of some of the models that we have now the keep in mind that um a lot of these models the the the list of which is updated every two or 3 weeks um so good to understand uh so so naturally I think when the paper is going to be updated in the future you will see at additional models um some of the high level I would say purposes that we see um these models are trying to achieve can be things like your general purpose ones so that's when you you get your model to do all sorts of things um there's also of course your multimodal ones right when you you ask the model when you give the model some uh image then may we ask the model to um decipher some fact or or draw some conclusion from the image right there's also of course your video related ones um there are some that are very specific to code generation so here are some of them um some they are very specific in the finance uh domain some they very specific in the science domain um and of course um there are some that are very useful for chat booot right so this is the list over here U there's a much more um detailed list uh in the paper itself um having said that of course as as as mentioned um there are also additional papers um that come out uh and so some of there are also some I would say missing um models models that were not mentioned right so these are some of them they have not mentioned so good to understand that this is always an evolving um list okay um so what are some of the features that we see in these models uh you've got things like your U instruction tuning which was talked about earlier uh we noticed that models are able to have increasingly High context Windows now the context windows are in the six figures uh sometimes even in the seven figures right um there are also um other ways in which llms can be used uh I think a very popular one is R um so there are I would say Beyond just your general purpose use you can always find tune them for very specific spefic purposes or purposes that are very specific to maybe your own Corpus or your own knowledge base so that's essentially what we're doing over here uh other topics for the read explore so essentially what we're doing over here is uh we're talking about ways actually most of these topics over here if you look at them are about parameter efficient fine tuning so things like quantization uh where let's say instead of representing a number in 32bit I represent my number in 8 bit or 4bit and see if I can uh still maintain the model accuracy generally the model accuracy will go down uh but the thing is you able to get lighter models smaller models uh that's actually very useful uh multimod llms we talked about earlier that take in things like uh images and video as inputs um adapter tuning essentially is when you just add another layer on top of the output and then you perform fine tuning on it um there are more sophisticated ways to use it where your adapter is used in m in two or more um models that means the same adapter is being used in let's say a general model and also let's say a uh GPT model um I've seen that in the U talk about embeding uh representation learning uh mixture of experts uh something that we've seen before so where instead of just um having one fit forward uh layer over here you actually able to rck them to different uh fit forward uh layers and then from there uh you'll be able to in a sense uh then once you multiply them together so you you'll be able to leverage on different uh I would say comp different vertical workflows of the model where each of the vertical first will learn uh different aspects right so that's essentially youre uh low rank adaptation or Lowa um this looks very popular recently so essentially what we're trying to do is if you're are able to reduce the number of uh parameters during your gradient updates then you actually use less compute to get your uh fine tune models and the idea behind it is that instead of so let's say over here instead of um calculating gradients for 64 parameters for an 8x8 Matrix what you can do is that if you can decompose this Matrix into a uh 8x2 and a 2x8 Matrix the key thing over here is that when you multiply this by this you get back uh the 64 you get back 64 weights or the resultant is uh H and 8 by8 Matrix which is 64 weights if you're able to decompose it uh with weights in a smaller Dimension essentially this idea is that and of course the how small it is is uh a hyper parameter for you to T tune then the cost of uh fine-tuning uh will go down okay so essentially that's what we're doing over here yeah so that's pretty much it for this segment um the last the next few segments uh this next segment essentially is about your data sets that can be used for training at least the public ones that we see uh we've got a these are things that we've seen before Wikipedia data set C4 data set common craw uh which is for your I would say more general purpose models um and then of course you've got some um data sets that can be used for very task specific uh models for example code generation you've got um data sets that is used for instru instruction F tuning and you've also got data sets that's used for enlightment so um essentially what happens is that if you go to maybe say um tensor data set tensor full data sets or hugging pH you'll be able to download them um and then you'll be able to observe um these data sets uh by itself and if let's say you want to maybe say um find tune a model for specific use uh these are actually useful I would say templates or schemas that you can use uh to prepare your data set so that you can do fine tuning um so this is instruction tuned um and this is for um getting the model to be more to to to have to display uh Behavior that's more aligned to uh our use so naturally this one I'm okay to share um some examples but this one you can go ahead and click on the link you'll be able to see the kind of examples um that's over there so let's say we've done our training or fine-tuning we find we found a way to um get our to to to update our parameters in a more efficient way the final part uh is of course evaluation so um I think will cover uh at a high level um two classes of uh model evaluations you've got things like your single task evaluations so very popular ones will be things like squat uh story close math uh M&L uh which is for question answering uh understanding context of words where you're filling in the blanks um answering questions answering math questions uh so mathematical reasoning uh and this is I believe uh natural language so essentially whether your whether the two sentences are uh they they follow each other or not right essentially whether uh the next sentence logically follows the first sentence of um and also things like truthful QA which validates whether a sentence uh whether the model outputs facts uh instead of maybe just say other kinds of um maybe trivia that's not true not truthful so so these are some of I would say your uh single task uh evals and then you've got your multitask evaluation things like glue things like MML uh things like super glue and of course there are a couple more that's inside um the list so what happens over here is that if we just take look just take a look at glue um there uh this is divided into multiple uh multiple individual evaluations so you've got things like uh natural language inference you've got things like um whether a sentence uh makes sense or not so that's your Cola you've got things like semantic similarity so essentially that's what's going on over here uh MML which is one of the more popular ways of uh doing benchmarks right now um so there's a big number of uh knowledge intensity T you can see over here um and of course super glue which is the second generation from clue which has um more I would say I would say questions that mimic um human behavior more or things that are a bit trickier for models to understand okay so um that is the part on evaluations uh so different kinds of applications I think we've seen many kinds so beyond just things like uh what's in the list uh we also see things like music generation we see things like uh video generation and naturally what what happens is that for each of them um there are also certain gut rules that need to be placed so what are some examples if let's say for a uh music generation uh model uh it is important to ensure that when we submit lyrics for uh the model to Output uh these lyrics should shouldn't be under any copyright if not then there might be legal consequences right so um this is something that um I would say depending on the domain that you're in um you will be looking at models that very speci viic domain so finally last part before we go into Q&A um what are some of the things that we see models exhibit so things like biases are very common stereotypes are very common uh and I guess the reason why it is is based on some of the training data that we see if the training data exhibits a certain Behavior naturally we see the model um exib exhibiting this Behavior so so that's I think one of the things that we want to be aware Ware of um and also things like uh models um memorizing private content so if let's say I've got a GPT model and I I Ty in a particular prompt U and this GPT model see some email and then it outputs some sort of phone number that is supposed to be private uh and let's say a user takes this and does a search on so essentially the idea is that this is the output from the model and you can see there's actually some information over here that might be that might be private you might have a phone number that's not supposed to be um exposed to the public um and then maybe someone searches for the phone number and there you might have an additional contact uh that maybe you can use right so um these are some of the things that we want to I would say be aware of um when it comes to uh the component about human alignment so on top of the Three Ages help making helpful being harmless and being honest uh you also want to ensure that your models um do not have do not do not leak out or do not learn certain private information um and generally what happens is that there is uh teams like there there are teams that are behind uh all these ways of conducting adversary attacks you can call them white head attacks or what people like to call uh rate teaming these models so essentially trying to generate adversarial um prompts or find ways such that theel will leak out something and then if they're able to do so they will fix it okay I think there's a few um interesting articles about that recently so essentially uh that is the paper uh it sounds like a firehost of information uh so if there's anything any topic you want to Deep dive into feel free to um take a look at the paper or take a look at this um and go into the topics that you're looking at so if let's say I want to just do do something on parameter efficient tuning feel free to just go into the segment so um I've linked all the papers over here I've also linked uh some of the external sources that have been useful for me uh over here so uh yeah feel free to take this as a reference guide for yourself uh and I think with that I've come to the end uh and I'm leaving about 10 more minutes if there's any q&as so Ian uh yeah dude thanks so much for thanks so much for giving the such a detail like walk through I think there was a question by Honan in the chat about paralyzation of like what exactly is the benefit of using a Transformer versus a I guess in this case a RNN RM do you want to maybe start with that like how the paralyzation works let me just take a look okay so if you think about it um let's look at this example over here one second let me just okay so the idea over here is um if you think about the traditional RNN what happens is that let's say I've got a sequence of 10 tokens and I want to calculate the hidden state of the the entire sequence in this case the sequence the the hidden state of the 10th token there is a dependency um of the ninth token and the dependency of the nin token is the sorry the ninth hidden State and the dependency of the hiden State the hidden state so on and so forth and essentially that's what's going on over here where um if let's say I want to calculate the second state the second hidden state of the second token in the sequence uh I need to calculate the first I need to calculate the first hidden State as an input um so that goes uh back to um either your RNN or lstms where um the hidden state is um calculated the the the input to the hidden state is the hidden state of the previous uh token and also the input token so the thing is that because there is this dependency um there is this Reliance on uh the future hidden States rely on the the previous hidden States and because of that there's no ability to paralyze um from a sequence perspective on the war clock perspective and therefore you see the first line back forward and back passers have all sequence length that means for how long the sequence length you have um you have to do that number of calculations does this make sense I think it makes sense to me at least the way I like to think about it is that let's say I had five sentences and they're not the same length in order for me to get the final hidden State before I can start evaluating its predictions I need to run like five passes and for each character in each sequence or each token in this case while a Transformer itself I can just P everything to the same length and pass it through in one time step so I can get everything out in like one one like output step one forward path at least that's my understanding of the paralyze yeah that makes sense I agree um I would say for this um this diagram we think of it during the training State naturally during the inference stage um we still have to there still um there's still this need of passing the hidden state of the current token back into um the Transformer the the model to get the the next token for sure for sure I was thinking more about the training issue but I think in terms of inference you you incur the additional cost at the Transformer additional token that RN doesn't I actually for me like one of the questions I had about the classification in this paper was that of prefix versus full language modeling because if you look at the example that they give in the text uh I think they give the example of they have this C cute little example which is if it's full language modeling they give the word May and then you output the the force B if it's prefix langage modeling it's made the Force and and then the models as to predict be with you but that just both seems like the same thing because my understanding of prefix language modeling was that oh we're going to specify specific token for example like uh like a bracket classify bracket like sentiment sort of like a T5 and the model learns that if it sees this specific like prefix then it should like sort of the it should perform differently so that was why I was a bit confused by in this specific paper that makes sense um I didn't look at the paper in particular so it's a little bit hard to uh comment on that I understand when you're are saying that you know this this and this really doesn't show a lot of difference um I think what I will I think what I can commend is that um generally in full language modeling what happens is that you uh okay this is this is of course the encoder decoder uh phase of things uh beyond the GP stuff so um generally what happens is that a full language what uh you you generate everything so in fact right maybe in this case uh you might just start with a beginning of sentence token and then you take maybe some hidden State and then you generate from there and then you Auto regressively sample from there uh which is different from the prefix language modeling where you are given a series of Tok the the beginning of sentence token naturally and then the series of tokens before you do your uh generation and then of course when you do your learning you are learning based on that that particular sequence of text more than just the uh beginning of sentence I'm not very sure I think we this one this one we've got to take a look at the paper to to fully understand it was also the it was the guy who was was the author of The T5 paper I believe oh really the guy who did this paper yes uh I think is calling yeah but go check yeah yeah I think we we can we can talk about this some other time it was just I was this was just something that confused me quite a good amount I guess the other thing that surprised me was just uh like learn traditional en codings cuzz I when we cover the original Transformer paper I think there was a section where they said oh we experimented with learn and Frozen positional encodings but it seems like you know like you mentioned that newer papers are starting to use learn positional encodings instead and it's shown like an increase in performance and I was wondering if maybe like you know what sort of change in your opinion to make this happen if that makes sense to be very honest uh it's I'm not very sure what um what were the changes that inspired it um maybe the way I would commend is that you know once they able to do so um they are able to represent they are able to efficiently represent um an input with a much longer context window so I think that probably what happened was that there was uh innovation in that space because the thing is that if let's say I've got maybe say 500 tokens or a th000 tokens um there might be a limitation on how you um you you you you uh model the model the positions because maybe the positions might all be just clustered in one area but I think once they have figured out how to do so that's when they open up the window to longer context window so maybe how they how they learn uh how they learn uh position and codings might be one of the tricks that they use to to to have longer contact windows but again I might be wrong uh didn't really go into the details of this uh part of research yeah for sure for sure yeah I was just wondering about because that was just something that I was intrigued by yeah I think we're almost at time and if anyone has any other questions you can drop in the chat uh if not maybe we can just end it here anyway okay um it seems like there's no more questions uh so anyway I think moving on to next week's paper uh I was thinking of doing the deepsee paper that was one thing I like to present the propos S uh cuz it I I thought it super interesting and uh there are a whole bunch of these ideas that they they experimenting with always on Experts randomly routing randomly routed experts so I thought it's a good paper um so as usual if if anyone wants to present on the paper itself for the upcoming week then um you know happy to help you with it I think you you generally learn a lot more when you when you actually do the paper I learn like at least like 10 times more if I if I actually have to sit down and present the paper so um I think as usual I'll probably just drop um like a trat inside the paper paper club and then if you guys have any other papers that you like to suggest you can add it on to the trat and then we can all vote for that yeah do you have any papers in Mind Brian anyone has any other papers that you guys want to read I'll take a look I'll take a look at then there are some there are some I would say um very open source models um so we'll see we'll see how maybe one day next month I can take a look at them yeah okay cool it sounds good to yeah then otherwise thank you so much guys for tuning today's session really appreciate it and uh yeah looking forward to next week guys CIA thanks everybody see you guys bye bye have a good evening

Original Description

First recording of the Asian-timezones Paper Club! We meet once a week on discord to discuss a chosen paper. You can find our events here: https://lu.ma/ls We covered the paper "A Comprehensive Overview of Large Language Models" and walked through a high level overview of all the biggest developments and changes in the LLM Space over the past few years. Paper link: https://arxiv.org/abs/2307.06435 Paper abstract: Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 22 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

The video discusses a comprehensive overview of large language models, covering topics such as the history of attention, post-Transformers era, and the GPT era, with a focus on paper reading and research methods. The discussion involves various concepts, including conditional language models, encoder-decoder architecture, and attention mechanisms. The video is suitable for advanced learners who want to deepen their understanding of LLMs and stay current with the latest research in the field.

Key Takeaways

Read and understand the paper on large language models
Analyze and discuss the content of the paper
Apply research methods to study large language models
Design and conduct experiments on LLMs
Understand the basics of large language models
Apply LLM foundations to real-world problems
Design and implement large language models
Optimize and fine-tune LLMs for specific tasks

💡 The video highlights the importance of understanding the latent space of large language models and its representation of the input text, which captures its semantic meaning and is used to improve the performance of LLMs in various tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling