[Paper Club] ๐ On Reasoning: Q-STaR and Friends!
Key Takeaways
The video discusses the Q-STaR paper and its friends, which propose a bootstrapping mechanism to create a rationale dataset from a few initial examples without needing to check new rationals' correctness, and introduces a positive loop where rationals that lead to correct answers are viewed as better rationals than those that lead to wrong answers. The paper also discusses the use of rationalization to accelerate and improve the bootstrapping process.
Full Transcript
and I've lost many recordings to that okay all right well um okay so I I'll I'll just go ahead um so star is a 2022 paper um I was surprised to see that it's actually basically just one guy's work um uh Eric now works at xai and um this is his website if you want to go see it um he is responsible for the first two papers um we're doing star we're doing Qui star and doing varar mostly just because they have star in the name uh but also they seem to be most mentioned uh by uh by the people that were throwing around all these like survey papers and stuff okay so star um I believe Eugene Yen you've already covered this before um but I think this is the most foundational and oldest so U I liked it the most I think uh and then I like varar the second and then qstar the least um so general idea of SAR is that uh we have this bootstrapping cycle um of creating a rationale for each answer um so when a um when a qu when a question is asked of a language model it's it is trained to to think think of ir rationale before answering so it's basically a ver a form of Chain of Thought that before you just SP the answer you think a little bit first um so it's very related to The Chain of Thought literature I think it's very related to Orca as well um and the interesting thing is that you know they they take the they they establish a positive Loop where the rationals uh you know they generate a bunch of candidate rationals basically and the rationals that lead to correct answers are viewed to be better rationals than rationals that lead to wrong answers um and uh and then are fine and that leads to a fine two notal language model um there's a second Loop which where the the if the rational leads a wrong answer they can generate generate a rationalization so uh these two words are pretty similar rationale and rationalization they're two different words as far as the paper is concerned uh and the rationalization if it does lead to a correct answer then it also gets fed back in so that wrong information is captured a little bit um we'll see later that there's actually ways to do this better uh that the original Star paper missed so methodology uh propose a bootstrapping mechanism to ity a rationale data set from a few initial examples with rationals without needing to check new rationals correctness um I really like these ideas these kinds of ideas where you can bootstrap from a small set of data and you don't need to really check um the the new set of data uh because that enables you to scale pretty massively we complement rational generalization with rationalization um where a model is Task with justifying an answer and then fine tune as if it come up the rationale without any hint so um I I think I think I I I think I have a slide on this okay cuz uh when I was doing the live stream um people were asking me what what what is this is rationalization the same thing as rationals basically it's not it's kind of back to front um so um uh so it it's very common for language models to get kind of stuck in uh Cycles where um it it just it just answers the same question it gets kind of stuck in a loop um so to over come this issue we we basically they propos rationalization for each problem that the model fails to answer correctly we generate a new rationale by providing the model with the correct answer this lets the model reason backward um so so okay you failed to do the the the correct path right and here we're doing only the positive fine tuning the last the next thing we do is we give a Hint by giving it the answer and then backwards rationalizing what rational would have led to the right answer and then and then and then throwing that into the the data set for fine tuning um rationalization accelerates and improves the Boost trapping process um so we have a little chart here uh showing the star with without without rationalization uh on a on a um uh on a addition problem of nend digits and showing that with rationalization it actually gets training a lot a lot faster uh than without so this is a pretty effective idea I think um and uh we'll see we'll see how to improve on that later I copi this out uh mostly because this is basically a very nice pseudo code I don't like the way it's presented but um you know I think for formalism this makes sense for some people um I don't really have any other um uh comments on this apart from like I I think you know when I when I when I originally read the paper when it uh started when it was presented like this like first we have the positive bootstrapping then we have the negative rationalization um and we have this and we have this uh it seemed like two Loops it seemed like f the first Loop is we we'll fine tune on correct answers and the second Loop is will correct wrong answers um but the uh algorithm that they present in pseudo code does everything in one Loop so you split the code path from uh you you generate rationals uh maybe you rationalize um you filter rationals using ground truth um you so like they're B performing two both Loops in one pass which seems like very efficient for some reason I don't know it's not how exactly I would do it but um this is probably more efficient um okay uh because it's a 2022 paper it actually started with GPT j6b so super small model by modern standards um they had a few data sets math um Common Sense QA and GSM AK um I don't really have any qu quarrels with this I don't think I wish the descriptions were better um they referenced Jason ways paper on Chain of Thought um but they didn't actually show the kind of few shot examples that they were doing um so in some sense this is a badly written paper in that it is going to be very hard to reproduce um because they did not show uh a lot of the the full sample of what they did uh they did have some transparency on on some of the the the the questions so um uh this is this is fun because there's some audience participation points um all right I just I'll just ask you the question without showing the answer all right so um here's the uh here here's the task that uh demonstrates the value of qar and the kind the the the subtle nuances of what you're being asked to do um that you might take for granted okay so here's a question with a multiple choice and three possible answers so I want uh whoever's listening or or watching oh Jimmy you're here hey Jimmy sorry I I said you weren't active anymore that was a lie um um okay so on their hike they brought a filtering straw they were worried about germs in the what answer choices make sick doctor uh water stream and mouth so the correct answer is water right there would if they brought a filtering straw they were worried about germs in the water we as humans know this now the question is to how to teach teach the machine to reason their way into understanding that water is the right choice and and you don't want to just give the right answer uh you don't want to just get the right answer with the wrong rationale you want to have the right rationale as well so for example answer a the answer must be something that can filter out germs filtering straws are used to filter out germs therefore the answer is filtering straw C this is wrong right it's like they got the right answer which is c c is the right answer but it's a wrong reasoning because because um when you say therefore the answer is filtering sh that is the wrong reason um B the answer must be something that would cause someone to build bring a filtering straw on a hike filtering straws are used to filter water therefore the answer is C this is a good reasoning trace the last answer straw is there's a typo here straw is something to use to drink water therefore the answer is water C right um so which is the best uh right so Eric says um cosmin yeah the slides are in the in the Discord uh Eric says the answer is C and D overlap they do overlap um I think the the more intuitive thing this is a very classic NLP entailment thing what is the more likely answer the more likely answer is water because there there's no Assumption of stream stream is more specific than water so uh when and doubt pick the more generally probable answer so so anyway the the human rated task is what is the best actual answer that you want if you're trying to train for a data set that has a reasoning choice right is it A1 A2 A3 um and is actually A2 right because A3 jumps straight to the answer right and A1 um jumps A1 A1 has the right answer but for the wrong reasons or like it has it has flawed uh data um so this star paper actually uh use human Raiders to choose between answers that were correct um and I think that's a that's an unusual way to use human Raiders usually you use human Raiders to choose um correct answers from wrong answers but here the human Raiders are being asked to um uh to evaluate reasoning and the quality of reasoning um Eugene CH says is there an issue of using all three uh what do you mean in the context of training like why can't we just train all three because as long as they are good enough well well this is bad this is bad A1 and A3 are bad this A1 A1 has faulty reasoning A3 has not enough reasoning right so uh A2 has just enough like logical flow cannot like super basic probably two verbals but like you cannot argue with any of the steps so so if you're if you're to fine tune a reasoning model this A2 is the kind of data set that you want and uh these these this the star paper star authors um employed human human Raiders to to to find this um so okay uh I'll I'll give you a little bit more on on on the details here but um uh when when the human Raiders were given this they they all randomized so like imagine just going through and picking A1 A2 A3 a182 a312 A3 uh for like uh 50,000 answers uh it was very laborious but it's kind of fun okay this one is this one is another one that's kind of fun again uh I'll just run run it through uh the human would would the human always would have fun making out questions for the AI overlords he found the task quite what answer choices do enjoy eat cake enjoy living get laid enjoyable um I I think I think it's worthwhile going through these kinds of questions to to you know key catchphrase look at your data uh and you look at your data um you really understand how inan and mind numbing but also Nuance some of these choices are right so what is the right choice I I I had trouble practicing this the human always have fun Mak questions for the AI overlords he found the task and this is also meta because it's making out questions for the a overlords uh he found the task quite what he found the task quite do enjoy no that's not grammatical he found the task quite eat cake no he he found the task quite get laid you know I said that D is the answer that I wish would happen if I you know if I answer enough questions for the AIS I'll get laid uh but actually the answer is e that's the actually I think the most grammatically correct answer so this is actually a grammar question rather than um than than anything um so again a A1 the answer must be something that human would enjoy doing blah blah blah therefore the answer is enjoyable so it's like they all got the right answer but they all took different paths to get there right the last question the last answer having fun is enjoyable therefore the answer is enjoyable um and B the the answer must be something that the human found enjoyable make it enjoyable so you can see like this is very laborious everyone's kind of reading this through it um and uh at the end of this whole thing then you're then uh the the big review is that um this is this is chain of thoughted uh unfin tuned um gptj so the first answer in the present is in the presented results the paper has uh a few dozen of these by the way in the first answer always gpj uh unfine tuned the last answer is human entry human some a human answering it uh and you can see like the human reasoning is humans are really bad at showing rationals um they they always just jump straight to the answer which is really funny uh I'll show you a counter example at the end where uh this is the opposite and then B was the the star answer star um generally fine-tuned to show show reasoning for any task uh very very well uh so I thought this was very impressive uh I'll give I'll go I'll give go for one more example Le um on the in the reasoning domain uh this is a math question so we're jumping from simple logic so this is super simple I I have to stress like we are so early in teaching language models to reason if this is the Pinnacle of reasoning right this is like not [ย __ย ] reasoning at all as as far as my IQ test is concerned but as far as GPT j6b is concerned they're good on this like we cannot take this for granted um okay here's here's here's a math reasoning using natural language right Natalia sold Clips to 48 of her friends in April then she sold half as many Clips in May how many Clips did Natalia sell altogether in April and May okay does anyone want to solve this please please feel free to think out loud and jump on the mic I want to make this interactive I don't want to make this like a lecture is it 72 all right so 4 + 24 uh 72 right um okay next question Betty is saving money for a new wallet which costs $100 uh Betty only has half the money that she needs her parents decided to give her $15 for that purpose and her grandparents twice as much as the parents how much more money does Betty need to buy the wallet uh can someone uh do CH live Chain of Thought um while solving this Eric sure um so Betty needs $100 she only has half of the money she needs so she has $50 MH her parents decided to give her $15 for that purpose so in total she has $65 her grandparents give her twice as much so twice as much as 15 is $30 so and 65 plus 30 is 95 so she needs five more dollars to reach 100 for the wallet perfect so the answer is 72 and five and both of you gave a little bit of change of thought would you be surprised that a language model can do that so um they uh these are the generated answers um of star uh showing what Eugene said and what Eric said I thought it pretty cool I actually would not have put it this is a 6B model yes that's pretty good actually um to do this level of Artic it is also that just flexible natural language understanding of just whatever just throw it in there it it it got it and it was really just the fine-tune um rationalization um step by step thinking step by step but not just the lazy kind of thinking step by step like what do I need to know first what do I need to know next uh how they combine those pieces of information how they copy how they calculate uh it's really good the the paper has like a quite a few dozen examples of this going on um this was after fine tuning right or it before it's after fine tuning um well that's actually kind impressive they um they have an N here on how how many times it's fine tuned uh they don't specify the N I looked for that number um I don't think it's that many I think like Max is like 20 or 30 iterations uh not not that not not a ton of iterations but the nend is a flexible hyper parameter that they used um so this is a two-step problem this is a three-step problem their problems going all the way up to like eight steps uh which is pretty impressive so that lets you generate a chart of human versus machine uh reasoning so Eugene and Eric in in answering those two questions they produce two steps and then we can also compare against the the model produce number of steps and there's a correlation of 53% 53 to 57% um in in a sense that when you give them a GSM AK question um star tends to think very very human like uh obviously it could be a lot better than 53 but uh it's surprising it's surprising that there's a general correlation at all um and uh and I think I think basically this is a this is a way of understanding reasoning in a structured format that I thought was insightful that I had not seen before um because once you can do something like this where I can say I can give you a a measurably harder problem because I give you a eight-step problem it's a harder problem than a two-step problem uh if I can give you a measur harder problem and I can I can roughly grade the calculator on its ability to get there um then I can improve it um so I thought it was it was that was pretty cool um there is so I think I'm about to finish this paper um there are some cases where the model uh model data set was actually um bad or GSM GM AK was bad here's an example of really stupidly confusing question um van is delivering 180 bottles of drinks to a neighborhood each bottle contains either cider a beer or a mixture of two out of the 180 bottles 40 only contain only cider 80 contain only beer the re the mixture of two drinks if the delivery man gives half the number of each bottle of drink to the first house how many bottles does the first house get um so there's this whole there's like a lot of random contexts but actually it's asking you to divide 180 by two um so the human gave this and star gave this see that's the human like w okay the human is read all the way to the end um so so this is good out of domain generalization in the sense that it it um we we all know data sets have errors um so this like Star improved on human uh it's good outof domain uh correction of of of bad data inside of the data set so it's kind of nice like star understood better than human uh which is really really interesting so you know I think the relevance here for um for o1 is that if we were to generate reasoning traces we would have to do work like this um where the rationale would have to be exposed um into into step-by-step thinking and um we would have to grade it um in in a way that uh that that makes sense right um so so that's my tldr any questions on Star yeah I have a question about the you said it was a data set of 50,000 did I hear that right earlier yeah generated a data set it was literally like 1+ 1 2 plus 2 3+ 3 11+ 11 111+ 111 you know stuff like that so synthetic data it was all sure but like it's not it's it's even shitty to call it synthetic data because there's like no llms involved it's just it's math like you know for it's a it's a for loop I see I mean what's great about math is it's very cheap to generate and we we we absolutely know the right answer but that's um the the math stuff lets lets us do things like this with with a very very high degree of certainty right um we had a little debate in the in the Discord yesterday about igit summarization so this is about adding one digigit numbers together it learns it very quickly adding two-digit numbers together takes a bit more time adding five-digit takes the longest time but it eventually learns it as well yeah do you feel that I think right now star and I think maybe VAR actually goes along this track do you feel like this is only limited at least in the F hearing stage to math and code I know they use gsmk uh no I I know they use Q Anda but it's limited to we have to have the correct answer someone correct answer where math and code you can infinitely generate as many as you want sort of like what rest em did um right what's the question I don't is it no it's like do you feel like this can generalize Beyond solely I do think so I I I I do I do think so like maybe there are something things like subjective answers like maybe yeah like this is not this is not math of code you know um yeah that's the thing this this is the one one thing where by the answer is very object uh okay yeah it's very objective and you rely the maybe I wonder if this could generalize to relevance maybe if I'm searching for an iPhone Should I be showing an iPhone or showing a iPhone case or new iPhone or showing a iPhone that's not you cannot order but it say pre-order it's like things like that where it's a little bit more General I just wonder how that will work but maybe there's no there's no answer to this as well um uh work left to Future readers I'm sure but like this is very impressive for 2022 paper um it is because uh it uh it is obviously you know something that we need to do um okay well there's some questions uh I think people have been talking in the chat um would be super Alex says would be super interested in seeing how the rational traces end up in an analysis like in um scaling monos semanticity if the particular inner function that empowers using language Define World model for rization yeah uh let us know when uh when you when you do uh Eugene Shia what is this oh from blink as well uh yes yeah it's a small sub 1B model that's able to do meth operations um is along the same idea of like we generate the basic meth operations and we just uh train it um and it works U with this giant humongous Chain of Thought for for the multiplication summation the crazy one the crazy thing that we did was that we inverted the numbers during the calculation and it seems to work better what what is inverted so instead of like you know like 1,200 is 1 120 Z it does a chain of is 0 0 to one oh I mean that that I can back rationalize that because when we do addition we do it from right to left right correct and we generate from wait we generate from the first digit to the last digit I'm mixing up my right and left right now yeah okay interesting um yeah but I think the Highlight is how small the model is yeah I mean so yeah sorry yeah does it does it do natural language questions or does it only do uh no only this is just a p ma uh model I you know I think like here you're basically just testing Universal function approximation uh and we know it does that so like you know uh that's all I got uh do you uh what is the to what is the tokenizer for RW KV do you do you tokenize um each number separately oh this a character encoding it was a toy model that we experimented on right right right right that makes sense okay yeah of course of course it'll do it nice nice proof um anything else so it it already works at this scale right it's just like adding that that layer then then you're able to do the rationalization then everything will be a chain up that is my point of view on like why even 8bs can actually do decent Chain of Thought math yeah yeah yeah uh someone is talking about medical domain uh I guess you know uh remains to be seen somebody that somebody needs to try it uh but I I think you can basically just take the methodology from here uh about the rating the answers and all that uh and and feeding it into the fine tune and and it'll probably work uh adtia says post fix notation plus inversion sounds smart for reason reasoning traces okay yep agreed uh Andre says can this kind of f tune be applied to any general model like Lama 3 yeah of course absolutely yeah this is a method uh that is general and uh I I I would be very surprised if they did not use this for uh for 01 okay moving on uh quiet star uh I am about to [ย __ย ] horribly on this paper because uh it was a waste of time um so this is the this is the same author as the star paper author this guy uh two years after the fact uh of of the original Star paper and basically he's trying to extend uh he's he's criticizing himself and saying like we inferred rationals and learned from those that the correct answer uh This is highly constrained setting ideally a language model could instead learn to infert unstated rationals in arbitrary text so he starts to have this idea of internal rationals and external rationals all of these rationals are externalized in the sense that you can see the uh the the the Chain of Thought that's going on in here now he wants to he basically read the uh pause token paper and wanted to apply it to Star so we present quiet star a generalization of star in which lm's learn to generate rationals at each token this is like crazy this is uh this is Cobar level crazy of like why you just throw at every single token what what what happened what would happen there um so the problem with obviously generating Chain of Thought at each token is that you're uh it costs a lot um the uh LM doesn't know how to do anything with those internal thoughts as well and you also need to look ahead a little bit more than than just the next token so they they have a parallel sampling algorithm um I don't super 1,000% get it but they have a really nice graphic which uh I'm going to show now so given given a given a text with token token token token token uh he's just trying really trying to show that you project uh you predict you know in a very sort of speculative decoding way in parallel but then you add a start uh thought token you generate a bunch of parallel thoughts maybe a bunch of tokens in in each of these things um only like up to 12 tokens by the way you end the thought process and then you and then you you cut out um whatever doesn't work and then you then you generate the next uh set of tokens this gift is is all there is um I I wish the the animation was better um but this is this is all all he has and he has he has a bit more uh of uh of predictions here um so so basically it's all it's like you have um let me see if I can get get you this better sentence um he has this this kind of graphic which is no help at all is is um but basically it's it's kind of like token token token uh and and you can generate thoughts for each token but then also continue continue the other uh tokens in process as well I like I I feel like it um there's something to this idea but it is very hard to communicate um but I think the the the the way that I would explain it is you have to read the uh the pause token paper first this one so uh maybe maybe I'll uh maybe I'll rearrange this slightly uh I'll just say you have to read this one then you go over here okay um so let me let me get there first before I before I uh go too hard on this um I I do like the way that he introduces his papers though because um it really helps you focus on what he thinks is novel about his paper right so with the star paper he he offered these four things he said these are the four things for me I personally highlight the first two because the last the last two are just evals um here he's he's highlighting uh six things um I think I would highlight the maybe the first three maybe the first four as relevant uh and and honestly I think three is the main idea so um so he's basically saying qar generalizes star to learn from reasoning from diverse unstructured Text data to our knowledge this is the first work explicitly training LMS to reason generally from text rather than on curated reasoning test or collection reasoning test so this is very in uh very relevant to what's that guy whoever asked about journalizing Beyond math um I think Eugene asked about it everyone everyone wants that generaliz me on math right um this is a way to do it I'm not sure it is scalable or usable but it's a way second one is parallel sampling it's the is the graphic I already showed you uh it's a parallelization technique nothing more uh third we introduce custom meta tokens this is what we'll dive into next fourth we apply mixing head to mix the next token prediction from the thought into the current into the next uh token prediction um so it's a little bit of like I don't know it's like speculative Chain of Thought decoding or whatever uh fifth non- myopic loss uh in including multiple tokens ahead um so this there's a there's a look ahead effect which we'll cover later uh and then six uh there's there's a bit of evals okay so this is everyone familiar with the pause before you think uh paper uh two is the main idea without two you pay high cost to two it comes for free RJ says two is is the main idea um I think it's the main idea if you have the full GPU and it doesn't take out the full GPU I guess I don't know uh because Auto like I I understand autoaggressive sampling like you're batching everything right is is RJ is that is that what like you're saying like the efficiency comes from batching I my take is sort of the and I you know I agree it was hard to understand but I thought that the attention mechanism where you already have these tokens in the like I mean for I don't think it's even related to the batching per se because you already have the whole sequence for that one set of of or for for that one piece of text and you're just masking some of it out with the attention mask so I think they're using the portions in the unused attention mask to yeah this diagram I think so I think they're taking advantage of the areas that would have been masked out by the attention mask and doing inference in that region and therefore they it comes for free on that was my on the same Hardware yeah yeah okay um I mean that's you that's cool I guess uh I I I like in normal inference economics I don't know if like if I was an API provider I'd be able to do this for my customers um yeah yeah that that's unclear to me too I guess uh it I I would want to dig in but like my question would be if if this is actually the case to what I'm saying then why isn't everyone doing this right because it's it seems kind of sort of like a little bit of complexity and a lot of benefit or like you know maybe not huge but at least marginal benefit I mean correct it's not worth it yeah that's I guess yeah I thought CL is doing the thinking token things uh it is simulating thinking tokens I don't think anyone actually believes that it is actually doing thinking tokens um yes that that would be my state with why not could you explain them bit what would they do instead uh they they they are prompted uh does anyone have the the cloud artifacts uh prompts system prpt there we go we go they are prompted to include and thinking tags and then the UI manually uh removes end thinking tags so this is not a thinking token this is a prompt yeah but I think it's just a question tokenizer because if the Open Bracket and thinking close bracket is a token itself then it is a thinking token we're typically um so sure but um thinking tokens in the context of This research um both qar and uh the actual thinking tokens paper treat their thinking tokens very differently uh they are never emitted um so that that would be my two cents on on that I I I understand like I don't know that it may be a distinction without a difference um yeah I suspect it might be that case I'll try researching down this just I have a separate chain of on this sure great um I would also recommend people to there's a backspace token paper uh the backspace token paper is not called the backspace token paper but anyway there was there was kind of one uh observation in the Wild on 40 where um uh Yan Lun always says you know they uh uh Auto regressive LMS when they once they start on the wrong path they will continue down a wrong path uh and chat gbt actually uh for the for For the First Time displayed its abil ability to self-correct in the middle of its own thinking um and that's kind of cool um so that this this generated a little bit bit of discussion as well uh we don't know if it's able to backspace or search or like this maybe he just got lucky but um there's there was an interview with John Schulman uh where he was mentioning models correcting themselves um I think in the dares yes uh and he mentioned that if I remember correctly they just put like 30 examples I think in pre-training where the mod where you would have some discussion like I'm solving this problem oh I'm wrong and then fixing and it's he said that like just having a few of these examples actually allowed the model to learn this ability to kind of okay double check double check their meaning there's also um um some theoretical work by researcher at meta looking into the deep internals of trans Transformers and he says the models kind of get in some state where they kind of know they're wrong but if you don't train them to kind of explicitly say they're wrong then they keep going so he he mentioned like um I'll post in Discord he mentioned like the the whole talk by Zan is really interesting but this particular part it seems like you can fix some of the the facts that are wrong by just allowing them to say hey I'm I'm changing my mind let me explore this other part thought it's relevant to your point here thanks yeah um i' be interested in a second one I I don't have a link for that um but if you if you find it uh drop it in the Discord I guess um okay I got a I got a plow along because we we got nine minutes left um what else can I say about this um okay so uh there are three stages think talk and learn uh I try to um I Tred to turn this really dense algorithm thing into into better pseudo code that is more a bit more accessible uh there's a lot of like parallel thinking there's about there's a lot of like uh adding the chain of uh the the thinking tokens and then there's a there's a bit of the mixing uh and uh updating model prms with the the teacher forcing uh so uh thinking we already talked about it uh talking with it's a it's an MLP it's a three layer MLP um I don't think it's it's super insightful apart from it you you don't want you don't want to all the have the thought token always influence the next state um you want uh do you want to introduce a mixing layer uh in the middle so it's on this right when they say mixing with and without thoughts what does it mean is it to mix in the original output without the thought and also mixing in the thought plus the expected output yeah W with and with without thoughts I think I think the the the demo the demonstrated thoughts are very very small yeah this this is not a truncation just for the graphic this is actually the thought yeah they they show that's very short so short 16 to 16 tokens right like eight to uh is it 12 or 16 uh oh 24 sorry uh so the the the amount of testing amount of thinking ahead is actually not a lot okay okay um so uh yeah I don't know I like if this one makes sense uh they they had a bit in a paper talking about how uh they're trying to do thinking ahead on every single token but then they also recognize that it's probably not useful to think on every single token and you probably want to trim it so I think the the MLP is just for filtering out like yeah what is what is curious to me is why they had to mix it why not just use the one that has a to alone I guess maybe maybe the one that has a to alone is too far from distribution and therefore they to mix it you know something similar to like how you would do with KO Divergence that that that intuition was just um I couldn't get that when I was reading the paper yesterday but maybe I'll weo it through again again and I mean thanks to your suggestion on the pause before you think Pap there was something I was missing it's obviously I I feel like he just read this paper and he was like I I also want this too but I'll do the star version um yeah so I see I see uh so so I mean it is nice It produced some really nice charts um uh where like you can extend extend the thinking ahead and you see you see that accuracy improves um so giving giving uh having a nice tunable hyper parameter for thinking ahead uh you know lets you lets you tune up your accuracy which is kind of cool uh obviously the the cost will increase he did this on M 7B um and open web map in C4 uh ah so the people were asking why why isn't everyone doing this well the Improvement isn't that much um it's like 10% on cqa uh 5% on gsm AK cool uh not I I don't know if I like I'm supposed to be impressed by that um okay so yeah so so here uh base mral 7B uh here here's here's some examples I got to move on to varar base res B uh takes this question from GSM AK uh Janice duck lay 16 eggs she minus 3 minus 4 so that's n and then she sells the remainder at $2 per egg how much in dollars does she make so base Ral 7 7B answerers the wrong answer uh because she's supposed to take 9 * 2 and give us 18 instead said it gives us 12 * 2 the 12 is hallucinated uh it gives us 24 whereas uh qar breaks out step by step with all this reasoning chains and gives us the correct answer at the end um so there's there's a lot of examples of this where um qar uh train examples with a lot more thinking ahead um maybe reduces hallucination and and that is the entire entire source of uh Advantage for qar okay um I yeah so I think like the the the the lift is not that impressive I think the the the ability for to deploy and production is not that impressive um so so uh I sorry I'm jumping around but I'm trying to look for the the the numbers uh yeah I think the I think the performance numbers in the end it's like not worth the the juice is not worth the squeeze as a famous uh member of the paper Club has said um like it it's it's kind of theoretically cool but um you know uh we probably need something better than this so varar uh Fe for February paper U done by a Mila PhD student uh unrelated to the other guy uh takes uh takes star in a different direction uh which I which I like a lot um so star again criticizes uh takes the same criticis criticism that like it's not taking not gaining enough information from the incorrect Solutions uh so it's potentially neglecting valuable solution uh valuable information so varar utilizes both correct and incorrect Solutions generated during self improvement process to train a verify verifier using DPO that judges correctness of model generator Solutions uh you could even call this an llm as a judge the verifier is used at inference time to select one solution among many candidate Solutions uh the the diagram is [ย __ย ] so I made it small okay uh but the the uh the the improvements is so much more Warf than qar uh that uh I I I think we should just uh look at look at varar instead um so VAR um uh is is is De is demonstrated to to to improve across a lot of these angles um I wish that uh I wish that had a better diagram I didn't have time to like make a mock diagram of this um but basically uh like training a verifier to judge between models let me let me just where's the where's the paper I like I'm pretty bullish on trading verifiers as a part of your process uh and then using that as an artifact to um to run in to run uh doing production um where can I show this uh so like they were they were comparing against all these other guys um varar versus all the others uh and I really like that you can just kind of apply it versus majority voting um and basically destroy it like VAR uh like VAR is able to scale um with the number of K candidates because you're already like in the training process you're already training the verifier and that verifier you can use separately from the the the raw model itself um which is kind of cool so you can you can basically pick out the right answers so let me show some examples oh yeah this is what I was I was trying to to offer um again this kind of paper not not really uh great but like here's an example of like kind of verifier they would train right so um here's a GSM AK question uh and here's two candidate answers that were generated by star uh V Star adds a verifier on that that basically trains to detect the right answer uh from from these right so um it would it would it would get a verifier score um and it would do it would do something like this where um you'll take you'll take a question you'll have a you have a solution uh it it'll pick among a list of candidate Solutions majority voting would pick the worst uh the most common solution rather than the most correct solution and VAR uses DPO to pick the most correct uh the most correct solution I hope I explained that correctly so I guess the question is how do they even train V what do you mean what was the input label the input label is correct solution and wrong solution yeah but how would they distinguish here's the here's the the algorithm at least at least label correctness I see a little bit more uh readable than the other guy um I'll be going into this yeah I I like I I like this just because like uh you want you want to maximize information from your data set uh your Information Gain uh and it was obvious that the original Star paper threw away a lot of uh correct stuff and like the or Star Pap's Insight was to do rationalizations but here uh here we're we're actually using that we're training that into a verif fire model uh which can get better over time but then also be deployed um and I like that idea that we can deploy this uh and not have to stick it into the the hope that we can fine-tune it into the base model um this this ties in but now you have two models right now you have two models yeah uh this ties in a lot with uh the uh let's verify step by step from openi so this um this is where I I I see us going from start qar into V Star um that we probably the verifier verifies the entire thing or like cuz though verify step by step verify sub parts right ver yeah it's a process reward model it verifies uh Parts along the way um yeah so the one in V is process reward or or is full uh is varar process reward that's a question I uh I don't think so I don't think it's a process reward uh but I think we could use it to to create process reward uh as well um but this is a relatively simple paper it it it only talks about um the the the label correctness at the end so this is this is outcome reward model not process reward thank you um so I don't know like there there's a body of literature that's is coming together that's like like oh it probably we'll use like some combination know all these things uh that makes sense I think we ran out of time uh okay I'll stop here uh stop the recording here and uh we can open up for other q&as or whatever oops no no no no no no
Original Description
Following the Strawberry launch, we'll survey a few related papers rumored to be relevant:
โSTaR: Boostrapping Reasoning with Reasoning (https://arxiv.org/abs/2203.14465)
โQuiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (https://arxiv.org/abs/2403.09629)
โV-STaR: Training Verifiers for Self-Taught Reasoners (https://arxiv.org/abs/2402.06457)
Join the LS paper club every wednesday: https://lu.ma/ls
Watch on YouTube โ
(saves to browser)
Sign in to unlock AI tutor explanation ยท โก30
Playlist
Uploads from Latent Space ยท Latent Space ยท 51 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
โถ
52
53
54
55
56
57
58
59
60
Ep 18: Petaflops to the People โ with George Hotz of tinycorp
Latent Space
FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
RWKV: Reinventing RNNs for the Transformer Era
Latent Space
Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Latent Space
RAG is a hack - with Jerry Liu of LlamaIndex
Latent Space
The End of Finetuning โ with Jeremy Howard of Fast.ai
Latent Space
Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space
Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Latent Space
Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Latent Space
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Latent Space
The "Normsky" architecture for AI coding agents โ with Beyang Liu + Steve Yegge of SourceGraph
Latent Space
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
Latent Space
The Accidental AI Canvas - with Steve Ruiz of tldraw
Latent Space
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Latent Space
The Four Wars of the AI Stack - Dec 2023 Recap
Latent Space
The State of AI in production โ with David Hsu of Retool
Latent Space
Building an open AI company - with Ce and Vipul of Together AI
Latent Space
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Latent Space
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
Latent Space
Open Source AI is AI we can Trust โ with Soumith Chintala of Meta AI
Latent Space
Making Transformers Sing - with Mikey Shulman of Suno
Latent Space
A Comprehensive Overview of Large Language Models - Latent Space Paper Club
Latent Space
Why Google failed to make GPT-3 -- with David Luan of Adept
Latent Space
Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Latent Space
Supervise the Process of AI Research โ with Jungwon Byun and Andreas Stuhlmรผller of Elicit
Latent Space
Breaking down the OG GPT Paper by Alec Radford
Latent Space
High Agency Pydantic over VC Backed Frameworks โ with Jason Liu of Instructor
Latent Space
This World Does Not Exist โ Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
Latent Space
LLM Asia Paper Club Survey Round
Latent Space
How to train a Million Context LLM โ with Mark Huang of Gradient.ai
Latent Space
How AI is Eating Finance - with Mike Conover of Brightwave
Latent Space
How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
Latent Space
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
The 10,000x Yolo Researcher Metagame โ with Yi Tay of Reka
Latent Space
Training Llama 2, 3 & 4: The Path to Open Source AGI โ with Thomas Scialom of Meta AI
Latent Space
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
Synthetic data + tool use for LLM improvements ๐ฆ
Latent Space
RLHF vs SFT to break out of local maxima ๐
Latent Space
The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Latent Space
Segment Anything 2: Memory + Vision = Object Permanence โ with Nikhila Ravi and Joseph Nelson
Latent Space
Answer.ai & AI Magic with Jeremy Howard
Latent Space
Is finetuning GPT4o worth it?
Latent Space
Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Latent Space
Building AGI with OpenAI's Structured Outputs API
Latent Space
Q* for model distillation ๐
Latent Space
Finetuning LoRAs on BILLIONS of tokens ๐ค
Latent Space
Cursor UX team is CRACKED ๐ป
Latent Space
Choosing the BEST OpenAI model ๐
Latent Space
How will OpenAI voice mode change API design?
Latent Space
STEALING OpenAI models data ๐ฅท
Latent Space
[Paper Club] ๐ On Reasoning: Q-STaR and Friends!
Latent Space
[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
Latent Space
The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Latent Space
llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
Latent Space
Prompt Engineer is NOT a job ๐
Latent Space
Prompt Mining LLMs for better prompts โ๏ธ
Latent Space
The six pillars of few-shot prompting ๐ง
Latent Space
Language Agents: From Reasoning to Acting โ with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Latent Space
[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
Latent Space
Can you separate intelligence and knowledge?
Latent Space
More on: Reading ML Papers
View skill โRelated AI Lessons
โก
โก
โก
โก
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium ยท AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
๐
Tutor Explanation
DeepCamp AI