The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

Latent Space · Advanced ·🎮 Reinforcement Learning ·2y ago
The origins of Reinforcement Learning from Human Feedback, RLHF, sociology's influence on it, the tension between human vs synthetic data, and emerging research in the field. Full notes and writeup: https://www.latent.space/p/rlhf-201 Timestamps [00:00:00] Introductions and background on the lecture origins [00:05:17] History of RL and its applications [00:10:09] Intellectual history of RLHF [00:13:47] RLHF for decision-making and pre-deep RL vs deep RL [00:20:19] Initial papers and intuitions around RLHF [00:27:57] The three phases of RLHF [00:31:09] Overfitting issues [00:34:47] How preferences get defined [00:40:35] Ballpark on LLaMA2 costs [00:42:50] Synthetic data for training [00:47:25] Technical deep dive in the RLHF process [00:54:34] Projection / best event sampling [00:57:49] Constitutional AI [01:04:13] DPO [01:08:54] What's the Allen Institute for AI? [01:13:43] Benchmarks and models comparisons

What You'll Learn

The video discusses the origins and future of Reinforcement Learning from Human Feedback (RLHF), its application in language models, and the tension between human and synthetic data. It features Nathan Lambert, a PhD in Robotics and model-based reinforcement learning, who shares his expertise on RLHF and its role in developing chatbots like ChatGPT. The video covers various topics, including the history of RLHF, its techniques, and its applications in language models, as well as the challenges a

Full Transcript

[Music] hey everyone welcome to the laden space podcast this is Celestial partner and CTO and residents and deible partners and I'm joined by my co-host swix founder of small AI hey and today we have Dr Nathan Lambert in the house welcome thanks guys uh you are you didn't have to come too far you got your PhD in Berkeley and uh it seems like you've you've lived there uh most of your time in recent years um you works on Robotics and modelbased reinforcement learning on your PhD and you're also interned at fair and deepbind um you were you bootstrapped the rhf team at hugging face and you recently joined the Allen Institute as a research scientist um so that's your quick bio uh what should people know about you that maybe is not super obvious about you on on new LinkedIn um I stay sane in various um insane sport Ultra endurance sport activities that I do what's a Ultra endurance sport activity um like longdistance trail running or gravel biking nice nice try to unplug sometimes although it's harder these days yeah um well you know just the Bay Area is just really good for that stuff right oh yeah you can't beat it and it's I have a materal head like 1.2 miles from my house which is pretty unmatchable in any other urban area yeah yeah pretty excellent um you also have uh a incredible blog interconnects which I which I'm a fan of um and I also recently discovered you have a new podcast retort yeah we do I've been writing for a while and I feel like I've finally started to write things that are understandable and fun after a few years lost in the wilderness if you ask some of my friends that I made read the earlier blogs are like oh this is yikes but yeah that was it's it's coming along and the podcast is with my friend Tom and we just kind of like Riff on what's actually happening on AI and not really do news Recaps but just what it all means and have a more critical perspective on the thing things that really are kind of funny but still very serious happening in the world of machine learning yeah awesome U for people who are new to your work like what would you highlight as like your greatest hits so far on like interconnects at least so like the ones that are most popular are like timely Andor opinion pieces so the first real breakout piece was in April and I also just wrote down the thing that everyone in AI was feeling which is like we're all feeling stressed um that we're going to get scooped and that we're overworked which is like behind the curtain it feels to work like work in Ai and then a similar one which we might talk on later in this was about my recent job search which wasn't the first time I wrote a job search post people always love that stuff it's I mean it's like easy for me to do in a way that it's very on brand and it's very helpful like I understand that until you've done it it's hard to share share this information and then the other popular ones are various model training techniques or fine tuning there's an early one on rhf which is this this stuff is all just like when I figure it out in my brain so I wrote an article that's like how rhf actually works which is just the intuitions I had put together in the summer about rhf and that was pretty well and then I opportunistically wrote about qar which you hate they have to do it but it is pretty funny I it's like from a literature perspective I'm like opening I publishes on work that is very related to mathematical reasoning so it's like oh you just poke a little around what they've already published and it seems pretty reasonable but we don't know they probably just got like a moderate bump on one of their benchmarks and then everyone lost their minds it doesn't really matter like this is why Sam Alo was fired uh I don't know anyway um yeah we're we're here to talk about allf 101 um you did a presentation and uh you I think you express some desire to be record it and that's why I reached out on Twitter saying like why not re record it with us and then we can ask questions and talk about it yeah sounds good I think it's I try to do it every six or 12 months is my current is my estimated Cadence just to refine the ways that I say things and people will see that we don't know that much more but we have a bit of better way of saying what we don't know yeah awesome um we can Dive Right In I don't know if there's any other uh topics that we want to lay out as groundw work no you have some awesome slides so for people listening on podcast only we're going to have the slides on our show notes and then we're going to have a YouTube version uh where we run through everything together sounds good yeah so I think to start skipping a lot of the like what is a language model stuff everyone knows that at this point I think the quote from the Llama 2 paper is a great kind of tidbit on rlf becoming like a real deal there's some uncertainty earlier in the year about whether or not rhf was really going to be important I think it was not that surprising that it is I mean with recent models still using it the signs were there but the LL 2 paper essentially reads like a bunch of NLP NLP researchers that were skeptical and surprised so the quote from the paper was meanwhile reinforcement learning known for its instability seemed a somewhat shadowy field for those in the NLP research Community however reinforcement learning proved highly effective particularly given its cost and time effective Effectiveness so you don't really know exactly what the costs and time that Meadow is looking at because they have a huge team in a pretty good amount of money here to release these llama models but like this is just the kind of thing that we're seeing now I think any major company that wasn't doing rhf is now realizing they have to have a team around this at the same time we don't have a lot of that in the like open and research communities at the same scale I think seeing that converge would be great but it's still very early days and other thing on the slide is some of anthropics work but everyone knows anthropic is kind of the Masters of this and they have some of their own techniques that we're going to talk about later on but that's kind of where we start can we do just a one second um RL div version so you come from a robotics background which RL used to be or maybe still is state-ofthe-art and then now you're seeing a lot of llm plus RL so you have the gym fans Eureka you have a MB which we had on on the podcast they went they started with RL now they're doing RL plus llms um yeah any thoughts there on how we got here like a maybe how the the pendulum will keep swinging I really think RL is about like a framing of viewing the world through trial and error learning and feedback and really just one that's focused on thinking about decision- making and inputs in the world and how inputs have reactions and in that a lot of people come from a different a lot of different backgrounds whether it's physics electrical engineering mechanical engineering there are obviously computer scientists but compared to other fields of Cs I do think it's a much more diverse background of people and like my background was in electrical engineering and doing Rob Robotics and things like that it it really just changes the worldview I think that reinforcement learning as it was back then so to say is really different it's like you're looking at these toy problems and the numbers are totally different and this like we everyone went kind of zero to one at scaling these things up but like people like Jim fan and other people that were you saw this transition in like the decision Transformer and papers and when people are trying to use Transformers to make decision to do decision making for things like offline RL and I think that was kind of like the early days but then once language models were so proven it's like everyone is using this tool for their research I think in the long run it will still settle out or RL will still be a field that people work on just because of these kind of fundamental things that I talked about that it's just viewing the whole problem formulation different than predicting text really and so there needs to be that separation and the view of RL in language models is pretty contrived already so it's not it's not like we're doing real RL I think the last slide that I have here is like how is a way to make rhf more like what people would think of with RL so like actually running things over time but it's a weird lineage of tools that happen to get us to where we are so that's why the name takes up so much space but it could have gone a lot of different ways cool we made it one before going on attention yeah I mean it's kind of it's kind of related this is a yeah so we have a history of RL yeah so I recently this in give the context this paper really started because I've had this different more diverse background than some computer scientist which is like trying to understand what the difference of a cost function or a reward function and a preference function would be without going into the all of the details like costs are normally things that control theorists would work with in these kind of closed domains and then re enforcement learning has always worked with rewards that's Central to the formulation that we'll see and then the idea was like okay we now are at preferences and each step along the way there's kind of different assumptions that you're making we'll get into these and those assumptions are built on other fields of work so that's what the slide is going to say is like RL of well directly building on tools from RL in language models is really implicitly impacted and built on theories and philosophies spanning tons of human history I think we site Aristotle in this paper which is fun it's like going pre going pre BC it's like 2,300 years old or something like that so that's the reason to do this I think we kind of list some things in the paper about summarizing what different presumptions of rhf could be I think going through these is actually kind of funny it's fun to talk about these and because they're kind of a grab bags of things that you'll see return throughout this podcast that we're talking about it like the Cor thing of rhf that in order to be a believer in this is that like RL actually works it's like if you have a reward function you can optimize it in some way and get a different performance out of it and you could do this at scale and you could do this in really complex environments which is like I don't know how to do that in all the domains like I don't know how to exactly make chat gbt so it's kind of well overshadow everything and then there's go from something kind of obvious like that and then you read the Von noyman Morganstern utility Morganstern utility theorem which is essentially a economic theory that says you can like wait different probabilities of different people which is a theoretical piece of work that is the foundation of utilitarianism and trying to quantify preferences is crucial to doing any sort of rlf and it if you look into this all of these things there's way more you could go into if you're interested in any of these this is kind of like grabbing a few random things and then kind of similar to that is the Bradley Terry model which is the fancy name for the pair wise preferences that everyone is doing and then all the things that are like that anthropic and open AI figured out that you can do which is that you can aggregate preferences from a bunch of different people and different sources and then when you actually do rhf you extract things from that data and then you train a model that works somehow and we don't know there's a lot of complex links there but if you want to be a believer in doing this at scale these are the sorts of things that you have to accept as prond conditions for doing rhf yeah you have a nice chart of like the sort of intellectual history of rlf um that we'll send people to refer to either in your paper or in the YouTube video for this podcast uh but I like the other slide that you have on like the presumptions that you need to have for rhf to work you already mentioned some of those and uh I don't know do you think like any one of them are are like sort of um which which one's underappreciated like this is the first time I've come across the V&M utility theorem yeah I know this is where you get from working with people like to my co-host on the podcast the retor is a sociologist by training philosophers likeis L into this like essentially there's even economic theories that like there's debate whether or not preferences exist at all and there's like different types of math you can use with whether or not you actually can model preferences at all so it's pretty obvious that rhf is built on the math that thinks that you can actually model any human preference but this is the sort of thing that's Deb been debated for a long time so all the work that's here is like and people hear about in their AI classes so like Jeremy benam like honic calculus honic calculus and all these things like these are the side of work where people assume that preferences can be measured and this is like I don't really know like when you look at this is I kind of go on a rant and I say that in rhf calling things a preference model is a little annoying because there's no inductive bias of what a preference is it's like if you to learn a robotic system and you learn a Dynamics model like hopefully that actually mirrors the world in some way of the Dynamics but with a preference model model it's like oh like I don't I don't know what this model like I don't know what chat gbt encodes as any sort of preference or what I would want it to be in a fair way and thropic has done more work on trying to write these things down but even like if you look at claude's Constitution like that doesn't mean the model believes these things it's just trained and to prioritize these things and that's kind of what the later points on looking at like what rhf is doing and if it's actually like a repeatable process in the data and in the training that's just unknown and we have a long way to go before we understand what this is and the link between preference data and any notion of like writing down a specific value did this connection between more you know sociology work versus computer work already exist or is it like a reason cross contamination because when we had Tre out on the podcast is a flesh attention came to be because at a they have so much overlap between systems engineer like a deep learning Engineers like is it the same in in this field there are a lot of PE so I've gone to a couple workshops where these the populations of people who you'd want to include this like are I think the reason why it's not really talked about is just because the rhf techniques that people use were built in like Labs like open Ai and Deep Mind where where there are some of these people they have they these places do a pretty good job of trying to get these people in the door when you compare them to like startups normal startups but like they're not bringing in like economic like academics from economics um like social Theory there's just too much like the the criticism of this paper that this is based on is like oh you're missing these things in RL or this decade of RL and it's like well like it would be literally be bigger than the Sutton and Barto book if you were to include everyone so it's really hard to include everyone in in a principled manner when you're designing this it's just a good way to understand and improve the communication of what rhf is and like what is a good reward model for society it really probably comes down to what an individual wants and it'll probably motivate models to mo more that direction and just a little bit about the communication which is a recurring theme and of my work is like I just get frustrated when people say things that don't really make sense especially when it's going to like man manipulate individual's values or manipulate the general view of AI or anything like this so that's kind of why rhf is so interesting it's like it's very vague and it's actual in what it's actually doing while the problem specification is very general so reinforcement learning I kind of mentioned this it's a trial and error type of system um the diagram in the slides is really this classic thing where you have an agent interacting with an environment so it's kind of this agent has some input to the environment which is called the action the environment returns a state and a reward and that repeats over time and the agent learns based on these states and these rewards that it's seeing and it should learn a policy that makes the rewards go up that's seems pretty simple then if you try to mentally map what this looks like in language which is slide seven is that like the language models don't make this easy I think with the language model it's very hard to Define what an environment is so if the language model is a policy and it's generating it's like the environment should be a human but setting up the infrastructure to take tens of thousands of prompts and generate them and then show them to a human and collect the human responses and then show that shove that into your tring architecture is very far away from working so we don't really have an environment we just have a reward model that returns a reward and the state doesn't really exist when you look at it like um an RL problem what happens is the state is a prompt and then you do a completion and then you throw it away and you grab a new prompt where really in like RL you as an RL researcher you would think of this as being like you take a state you complete get some completion from it and then you look at what that is and you keep kind of iterating on it and all of that isn't here which is why he rhf referred to as Bandits problem which is kind of like you choose one action and then you watch the Dynamics play out there's many more debates that you can have in this if you get the right RL people in the room then kind of like this is an RL even when you zoom into what rly CH is doing does this change if you as you think about um Chain of Thought reasoning and things like that like does the state become part of the chain that you're going through there's work that I mentioned on one slide called process reward models that essentially rewards each step in the Chain of Thought reasoning which it doesn't really give the part of interaction but it does make it a little bit more fine grained where you can think about like calling it at least you have many states from your initial state that formulation I don't think people have fully settled on I think there's a bunch of great work out there like even open AI is releasing a lot of this and let's verify step by step is there like pretty great paper on the matter I think in the next year that'll probably get kind of made more concrete by the community on like if you can easily draw out like if Chain of Thought reasoning is more like RL um rlf for decision making uh you have a slide here that compares uh pre- de RL versus deep RL yeah this is just to say that this is getting into the history of things which is showing that the work that people are using now really came from well outside of NL p and it came before deep learning was big and the the step from this paper Tamer which is from 2008 some names that are still really relevant in kind of human Centric RL um Bradley Knox and Peter Stone they if you have an agent take an action you would just have a human give a score from zero to one as a reward rather than having a reward function and then with that classifier you can do something with a policy that learns to take actions to maximize that reward it's a pretty simple setup it Works in simple domains and then the reason why this is interesting is you compare it to the paper that everyone knows which is this Paul Cristiano at all deep reinforced learning from Human preferences paper which is where they showed that learning from Human preferences you can solve like the basic RL tasks at the time so various control problems and simulation and the this kind of like human preferences approach had higher rewards in some environments than if you just threw RL at the environment that returned to reward so the preferences thing was you took two trajectories so in this case it was like complete trajectories of the agent and the human was labeling which one is better and you could see how this kind of comes to be like the pair wise preferences that are used today that we'll talk about and there's also a really kind of interesting nugget that is the trajectory that the humans were labeling over has a lot more information than the RL algorithm would see if you just had one state which is kind of why people think that it's like why the performance on this paper was so strong but I still think that it's surprising that there isn't more like RL work of the style happening now as this paper was in 2017 so it's like six years later and I haven't seen things that are exactly similar but it's a great paper to understand where stuff that's happening now kind of came from and that's what the next few slides kind of go into just to just on on the Cristiano paper um you mentioned the performance being strong I don't remember uh what what results should I have in mind when I think about that paper um it's mostly like if you think about RL learning curve which is like on the x- axis you have environment interactions on the y- axis you have performance you can think about different like ablation studies of between algorithms so I think they use like a2c which I don't even remember what that stands for as their Baseline but if you do the human preference version on a bunch of environments like the human preference labels the the agent was able to learn faster than if it just learned from the signal from the environment which means like the setup does it it's it's happening because the reward model has more information than the agent would but like the fact that it can do better I was like that's pretty surprising to me because RL algorithms are pretty sensitive so like like okay yeah which it's just one thing I do want to establish as a baseline for our listeners um um like we are updating all the weights right like this is this is um in some sense the next token prediction task of training a language model is a form of rein reinforcement learning except that it's not from Human feedback it's just uh un self-supervised learning from a general Corpus yeah um there's one distinction which I love which is that you can actually give negative feedback whereas in a in a general sort of pre-training um situation you you cannot um and maybe like the the order magnitude of feedback like the Liker scale that you're going to talk about in in future slides um that actually just gives more signal than uh a typical trading process would would do in a language model setting yeah I don't think I'm the right person to comment exactly but like there is you can make analogy that reinforcement learning is self-supervised learning as well like there are a lot of things that'll point to that on like whether or not it's a richer signal I think that's could be seen in the results is there I think it's it's a good thing for people to look into more it's like try to it's like as reinforcement learning is so much less compute like it is a richer signal in terms of its impact as if they could do what rhf is doing at pre-training they would but they don't they don't know how to have that effect in like a stable m yeah otherwise everyone would do it like so for on a practical basis like as someone fine tuning models I have often wished for negative fine tuning which like pretty much doesn't exist in open ey land um and uh it's not the default setup in how does this work in like diffusion models and stuff because you can give negative prompts to something to like stable diffusion or whatever that's that's for it's for guidance that's for clip guidance is that just from like how they prompt it then I don't I'm just wondering if we could do something similar it's tangent right right um anyway so like uh I I do want to sort of spell that out for people in case they haven't made the connection between rhf and the rest of the training process they might they might have some familiarity with these coming slides come really dig into this which is like this 2018 paper that was a position paper from some of bunch of the same authors from the Cristiano paper and from the like opening ey work that everyone knows which is like um some they write a position paper on what a preference reward model could do to solve alignment for agents and it's kind of based on two assumptions the first assumption is that we can learn user intentions to a sufficiently high accuracy that doesn't last with me because I don't like I don't know what that means but the second one is pretty telling in the context of rhf which is for many tasks you want to solve EV valuation of outcomes is easier than producing the correct behavior and this is the whole thing it's like we can compare two poems that the model generates and it can be viewed as improving a like liking a positive example or it could beew viewed as really disliking a negative example and that's what I think a lot of people are doing in like the harm space is like a harmful response to a language model whether or not you agree with the company's definition of harms is that it is just it's a really bad negative example and they downweight them by preferring something more benign in the r process among other ways of dealing with safety so this is a good way of saying it's like this is core this kind of like comparison and positive or negative example is core to all of the Chef work that has continued yeah uh uh maybe I'll try to put a more colloquial restatement of this uh people often say I don't know what I want but I'll know it when I see it this is that expressed in reinforcement learning it is yeah it is that's that's what everyone's doing in the preference modeling stage that we'll get to y yeah and you can see there are more papers this is really just to have all the um links for people to go deeper there's a z Zigler at all paper in 2019 which shows that you can do this rhf process on language models this familiar diagram start to emerge in 2019 it's just to show that this goes really far back I think we can kind of Breeze through some of these and then 2020 is the first open AI experiment that I think caught people's eyes which was this learning to summarize experiment it has this three-step process that we'll go to into more when I kind of go into the main Concepts but this is like the first time you see this diagram that they've reused with instruct GPT they reused with chat GPT and the types of examples that they would have I don't think I need to read these exactly but one that I have read a whole bunch of times is like they took these prompts from Reddit that was like explain like I'm five or get career advice and people really pour their heart and soul into these so these are like multi-paragraph pieces of writing and then they essentially do comparisons between a vanilla language model like I think it was was the timeline either gpt2 or gpt3 I always get the exact uh three was early 2020 so that's about right yeah so this is probably done with gpt2 it doesn't really matter but the language model does normal things when you do fuse shot which is like it repeats itself it doesn't have nice text and what they did is this was the first time where the language model would generate like pretty nice text from an output it was restricted to the summarization domain but I think that I I this is where I wish I was paying attention more because I would see the paper but I didn't know to read language model outputs and kind of understand this qualitative sense of the models very well then because you look at the plots in the papers these like Su learning to summarize and instruct PT have incredibly pretty plots just with like nicely separated lines with aor bars and they like super fast fine tuning works the RL step works but if you were early to see like how different the language that was written by these models was I think you could have been early to like things like chat gbt and knowing RF would matter but that's now I think the the good people know to chat with language models but not even everyone does this like people are still looking at numbers and I think open AI probably figured it out when they were doing this how important that could be and then they had years to kind of chisel away at that and that's why they're doing so well now yeah I mean arguably you know it's well known that chat GPT was kind of an accident that they didn't they didn't think it would be that big of a deal yeah so maybe they didn't maybe they didn't but they were getting the proxy that they want that they needed I've heard off the Record from other labs that it wasn't in the air if if open ey didn't do it someone else would have done it yeah um so you've mentioned a couple of other papers that are very seminal to this period And I I love how you say way back when in referring to 2019 it feels like it in my life um so how like how much should people understand the relationship between rhf instruction tuning po um kale Divergence anything like that like how would you construct the level of knowledge that people should dive into like what what should people know at the high level and then if people want to dive in deeper what where do they go like um is instruct tuning um important here um is or is that part of the overall process towards modern rhf I think for most people instruction tuning is probably still more important in their day-to-day life I think instruction tuning works very well you can write samples by hand that make sense you can get the model to learn from them you could do this with very low compute it's easy to almost in like no code Solutions at this point and the loss function is really straightforward and then in if you're interested in rhf you can kind of learn from it from a different perspective which is like how the instruction tuning distribution makes it easier for your rhf model to learn there's a lot of details with like depending on your preference data if it's close to your instruction model or not if that matters but that's really at the r jef stage so I think it's nice to segment and just kind of understand what your level of in investment and goals are I think instruction tuning still can do most of the what you want to do and it's like if you want to think about rhf at least before DPO really had taken off at all it would be like you want to have a team of at least like five people if you're really thinking about doing rlf I think DPO makes it a little bit easier but that's still really limited to kind of one data set that everyone's using at this point like everyone's using this Ultra feedback data set and it boosts alpaca Val empty bench truthful QA and like the qualitative model a bit we don't really know why and it's like it might just be that data set combined with the with the method but you've got to be ready for a bumpy ride if you're wanting to try to do rhf I don't really recommend most startups to do it unless it's like going to provide them a clear competitive advantage in their kind of Niche yeah because you're not you're not going to make your model chat GPT like better than open AI or anything like that you've got to accept that there's some exploration there and you might get a vein in your specific like a vein of benefit in your specific domain but I'm still like oh be careful going into the RF can of worms you probably don't need to okay um so there's a bit of a Time skip in what you mentioned DPO is like a couple months old so we'll leave that towards the end um but the I think the the main result that I think most people talk about at this stage we're talking about September 2020 and then going into I guess maybe last year uh was vonia uh as as one of the more interesting applications of instru instruction tuning that uh pushed llama one from like let's say a GPC 3ish model to a GPC 3.5 model in in pure open source with not a lot of resources I think I mean they they they said something like you know they use like under $100 to to make this yeah like instruction tuning can really go a long way I think the claims of chat PT level are long overblown in most of the things in open source I think it's not to say like vuno was a huge step and it's just kind of showing that instruction tuning with the right data will completely change what it feels like to talk with your model and from text completion to actually chatting back and forth multi-turn like yeah instruction tuning can be multi-turn just having a little bit of data that's like a couple turns can go really long way MH and it's it's I think it's people that was like the story of the whole first part of the year is like people will be surprised by how far you can take instruction tuning on a small model I think the things that people see now is like the small models don't really handle Nuance as well and they could be more repetitive if even if they have really good instruction tuning but if you take that kind of 7 to 70 billion parameter jump like this the instruction tuning at the bigger model is like robustness little things make more sense but that's still just with instruction tuning and and scale more than anything else yeah excellent uh shall we go to technical overview yeah this is kind of where we go through my own version of this like three-phase process you talk about instruction t which we've talked about a lot it's it's funny because all these things instruction tuning has the fewest slides even though it's like the most practical thing for most people we could save the debate for like if the big Labs still do instruction tuning for later but that's kind of that's that's a coming wave for people and then like preference data and training and then kind of like what does reinforcement learning optimization actually mean we talk about these sequentially because you really have to be able to do each of them to be able to do the next one you need to be able to have a model that's chatty or helpful instruction following every company has their own word that they like to assign to what instructions mean and then once you have that you can collect preference data and do some sort of optimization when you say word you mean like like angle bracket inst or do you mean something else oh I don't even know what inst means but just saying like they use their adjective that they like I think entropic also like steerable is another one just the way they describe it yeah yeah so like instruction tuning we've covered most of this is really about like you should try to adapt your model to specific needs it makes the models makes models that were only okay extremely comprehensible a lot of the times it's where you start to get things like chat templates so if you want to do system prompts if you want to ask your model like um act like a pirate that's one of the ones I always do which is always funny but like whatever you like act like a chef like anything this is where those types of things that people really know in language models start to get applied so it's good as a kind of starting point because this chat is used in rhf and all of these things down the line but there a basic pointer it's like once you see this with instruction tuning you really know it which is like you take things like stack Overflow where you have a question and an answer you format that data really nicely you push it through the model the model then kind of knows what to do when somebody asks a question there's much more like there's surely kind of more tricky things that people do but I still think the vast majority of it is question answer it's like please explain this topic to me generate this thing for me that hasn't changed that much this year I think people have just gotten better at kind of scaling up the data that they need yeah this is where this talk will kind of take a whole left turn into more technical detail land um I put r a slide with the rhf objective which I think is good for people to know I've started going back to this more it just kind of understand what is trying to happen here and what type of math people could do I think because of this algorithm we've mentioned this it's in the air direct preference optimization but everything kind of comes from an equation of trying to learn a policy that maximizes the reward the reward is some learned metric a lot can be said about what the reward should be subject to some constraint which the most popular constraint is a k distraint which is just a distributional distance essentially in language models that means if you have a completion from your instruction or RF model you can compare that completion to a base model and looking at the log probs from the model which are essentially How likely each token is you can see a rough calculation of the distance between these two models just as a scalar number I think what that actually looks like in code you can look at it it'll be like a a sum of log probs that you get right from the model it'll look much more simpler than it sounds but is just to make the optimization kind of stay on tracks it's a guard rail that's make sure it doesn't overfit to rhf data because we have so little data in rhf overfitting is really something that could happen I think this just it'll fit to specific features that labelers like to see that the model likes to generate um punctuation weird tokens like calculator tokens like it could overfit to anything if it's in the data a lot and it happens to be in a specific format and the K constraint prevents that there's not that much documented work on that but there's a lot of people that know if you take that away it just doesn't work at all so it is important but it I think it's something that people don't focus on too much but this objective as I said it's just kind of you optimize the reward the reward is where the human part of this comes in we'll talk about that next and then subject to a constraint don't don't change the model too much the real questions are how do you implement the reward and then how do you make the reward go up in a meaningful way so like a preference model the task is kind of to design a human reward I think the key the equation that most of the stuff is work based on right now is something called a Bradley Terry model which is like a pairwise preference model where you compare two completions and you say which one you like better it'll show a interface that anthropic uses here and the Bradley Terry model is really a fancy probability between two selections and what's happening in the math is that if you look at the prob you're looking at the probability that the chosen completion the one you like better is actually the better completion over the rejected completion and the what these preference models do is they um assume this probability is correlated to reward so if you just sample from this probability it'll give you a scaler and then you use that reward later on to signify like what piece of text is is better I think I don't know I'm kind of inclined to Breeze through the math stuff because otherwise it's going to be not as good to listen to yeah I no no I think people want to hear it you know I think there's a lot of like higher level explanations out there so yeah yeah so the real thing is you need to assign a scaler reward of how good a response is and that's not necessarily that easy to understand because like if we take back to the one of the first works I mentioned this Tamer thing for decision- making like people tried that with language models which is if you have a prompt in completion and you just have someone rate it from 0 to 10 could you then train a reward model on all these completions in 0 to 10 ratings and see if you could actually change like get some can you get CH TBT with that and the answer is really kind of no like a lot of people tried that it didn't really work and then that's why they tried this par wise preference thing and it happened to work and this Bradley Terry model comes from like the 50s it's it's really it's it's from these fields that I was mentioning earlier and it's wild how much of this happens I mean this this screenshot I have in the slides is from the DPO paper I think it might be the appendix but like it's it's still really around in the literature of what people are doing for rlf yeah so it's a fun one to know I'll point out one presumption that this heavily relies on you mentioned this as part of your six presumptions that we covered earlier which is that you can aggregate these preferences um this is not exactly true among all humans right like I have a preference of one thing you have a preference of the different thing um and actually coming from economics you mentioned economics earlier there's a there's a theorem or a name for this called Arrow Arrow impossibility which I'm sure you've come across yeah it's it's it's one of the many kind of things we throw around in the paper right do we just ignore it yeah just yeah just aggregate yeah okay yeah I think the reason this really is done on a deep level is that you're not actually trying to model any like contestable preference in this like you're not trying to go into things that are controversial or anything it it's really like the notion of preference is trying to stay around like correctness and style rather than any meaningful notion of preference because otherwise these companies really don't want to they don't want to do this like at all I think that's just how it is and it's like if you look at what people actually do so I have a bunch of slides on the feedback interface I think and they all published this it's always at the appendices of every paper yeah it's pretty interesting yeah there's something later on in this talk which is like but it's good to mention in this is when you're doing this preference collection you write out a very long document of instructions to people that are collecting this data it's like this is the hierarchy of what we want to prioritize something am like factuality helpfulness honest and these are all different things every company will rank these in different ways provide extensive examples it's like if you see these two answers you should select this one and why and all of this stuff and then my kind of like head scratching is like why don't we check if the models actually do these things that we tell the data annotators to collect but I think it's because the model like it's hard to make that attribution and it'll be really it's hard to test if a model is honest and stuff it would just be nice to understand the kind of causal me mechanisms as a researcher or like if or goals are met but at a simple level what it boils down to I have a lot of a lot more images than I need it's like you're having a conversation with an AI something like tgbt you get shown two responses or more in some papers and then you have to choose which one is better I think something you'll hear a lot in this space is something called a lyer scale lyer is a name it's a name for probably some research and economics decision the something but essentially it's a type of scale where if you have integers from like 1 to8 um the middle numbers will represent something close to a tie and the smallest numbers will represent one model being way better than the other and the biggest model the biggest numbers will be like the other models better so in a case of 1 to eight if you're comparing models A to B if you return a one if you really liked option A you return eight if you really liked B and then like a four or five they were close there's other ways to collect this data this one's become really popular we played with it a bit at hugging PA it's hard to use filling out this preference data is really hard you read like multiple paragraphs it's not for me some people really like it I hear I'm like I can't imagine sitting there and reading AI generated text and like having to do that for my job but a lot of these early papers in rhf have good examples of what was done the one I have here is from anthropics um collection demo because it was from slides that I did with anthropic but you can look up these in the various papers it it looks like trap upt with two responses and then you have an option to to say which one is better it's nothing crazy the infrastructure is almost exactly the same but they just log which one you you think is better I think places like scale are also really big in this where a lot of the labeler companies will help control like who's doing how many samples you have multiple people go over the same sample once and like what happens if there's disagreement I don't really think this disagreement data is used for anything but it's good to know like what the distribution of prompts is who doing it how many samples you have controlling the workforce all of this is very hard a last thing to add is that a lot of these companies do collect optional metadata I think the anthropic example shows a rating of like um how good was the T how good was the prompt or the conversation from good to bad because things matter like if you have there's kind of a quadrant in preference data in my mind which is you're comparing a good answer to a good answer which is like really interesting signal and then there's kind of the option of you're comparing a bad answer to a bad answer answer which just like you you don't want to drain your model on it's like this is like we did this at hugging base and it was like our data was like we like don't know if we can use this cuz a lot of it was just bad answer to bad answer cuz we were like rushing to try to do this real contract and then there's also good answer to bad answer which I think is probably pretty reasonable to include you just prefer the good one and move on with your life yeah those are very different scenarios I think open AIS of the world are all in good answer good answer and have learned to eliminate everything else but when when people try to do this and open source is probably like what open Assistant saws like there's just a lot of bad answers in your preference data and you're like what do I do with this yeah metadata Flags can help I threw in the um slide 28 is like the instruct GPT metadata you can see how much they collect here and like everything from the model fails to actually complete the task hallucinations different types of offensive or dangerous content moral judgment expresses opinion like I don't know if exactly if they're doing this now but you can kind of see why doing rhf at scale and prioritizing a lot of different endpoints would be hard because these are all things that you like I I'd be interested if I was scaling up a big team to do rhf and like what is going into the preference data and what happens you do a experiment you're like okay we remove all the data where they said the model hallucinates like does that and then retrain everything like what does that do yeah so Hallucination is big but some of these other metadata categories and I've seen this in a lot of papers uh it's like does it sexual content does it express a moral judgment does it denigrate a protected class that kind of stuff very binary um should people try to adjust for this at the rhf layer or should they put it as a pipeline where they have a classifier as a separate model that is uh that grades the model output do you mean for training or like a deployment deployment I do think that people are doing it at deployment I think right we've seen safety and other things in the rhf pipeline um like Lama 2 is famous for kind of having this like helpfulness and safety reward models deep in the Gemini report is something that Gemini has like four things which is like helpfulness factuality maybe safety maybe something else but places like anthropic and chachu BT and Bard most surely have a classifier after which is like is this text good is this text bad and that's not that surprising I think because you could use like aund times smaller language model and do much better at filtering than rhf but I do think it's still so deeply intertwined with the motivation of rhf to be for safety that some of these categories still persist I think that's something I'll kind of settle out I think I'm just wondering if it's worth collecting this data for the rhf purpose if you're not going to use it anyway because you're just going to use a separate model to yeah I don't think open AI will collect all of this anymore but I think for research perspective it's very exciteful to know but it's also expensive so essentially your preference data scales with how many minutes it takes for you to do each task and every button is like it scales pretty linearly yeah so it it's not cheap stuff can we uh since you mentioned expensiveness um and I think you may have joined one one of our spaces back in uh when llama 2 was released we had an estimate from you that was something on the order of llama 2 cost $3 to $6 million to train GPU wise and then it was something like 20 to $30 million in preference data uh is is that something that's still in the ballpark like I don't need precise I think it's still ballpark I know that there's the 20 million was off by a factor of four because I was converting from a prompt number to a total data point so essentially when
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 14 of 60

1 Ep 18: Petaflops to the People — with George Hotz of tinycorp
Ep 18: Petaflops to the People — with George Hotz of tinycorp
Latent Space
2 FlashAttention-2: Making Transformers 800% faster AND exact
FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
3 RWKV: Reinventing RNNs for the Transformer Era
RWKV: Reinventing RNNs for the Transformer Era
Latent Space
4 Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai
Latent Space
5 RAG is a hack - with Jerry Liu of LlamaIndex
RAG is a hack - with Jerry Liu of LlamaIndex
Latent Space
6 The End of Finetuning — with Jeremy Howard of Fast.ai
The End of Finetuning — with Jeremy Howard of Fast.ai
Latent Space
7 Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Latent Space
8 Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Powering your Copilot for Data - with Artem Keydunov from Cube.dev
Latent Space
9 Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Beating GPT-4 with Open Source Models - with Michael Royzen of Phind
Latent Space
10 The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
Latent Space
11 The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Latent Space
12 The AI-First Graphics Editor - with Suhail Doshi of Playground AI
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
Latent Space
13 The Accidental AI Canvas - with Steve Ruiz of tldraw
The Accidental AI Canvas - with Steve Ruiz of tldraw
Latent Space
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert
Latent Space
15 The Four Wars of the AI Stack - Dec 2023 Recap
The Four Wars of the AI Stack - Dec 2023 Recap
Latent Space
16 The State of AI in production — with David Hsu of Retool
The State of AI in production — with David Hsu of Retool
Latent Space
17 Building an open AI company - with Ce and Vipul of Together AI
Building an open AI company - with Ce and Vipul of Together AI
Latent Space
18 Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
Latent Space
19 A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
Latent Space
20 Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
Latent Space
21 Making Transformers Sing - with Mikey Shulman of Suno
Making Transformers Sing - with Mikey Shulman of Suno
Latent Space
22 A Comprehensive Overview of Large Language Models - Latent Space Paper Club
A Comprehensive Overview of Large Language Models - Latent Space Paper Club
Latent Space
23 Why Google failed to make GPT-3 -- with David Luan of Adept
Why Google failed to make GPT-3 -- with David Luan of Adept
Latent Space
24 Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI
Latent Space
25 Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit
Latent Space
26 Breaking down the OG GPT Paper by Alec Radford
Breaking down the OG GPT Paper by Alec Radford
Latent Space
27 High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor
High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor
Latent Space
28 This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)
Latent Space
29 LLM Asia Paper Club Survey Round
LLM Asia Paper Club Survey Round
Latent Space
30 How to train a Million Context LLM — with Mark Huang of Gradient.ai
How to train a Million Context LLM — with Mark Huang of Gradient.ai
Latent Space
31 How AI is Eating Finance - with Mike Conover of Brightwave
How AI is Eating Finance - with Mike Conover of Brightwave
Latent Space
32 How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)
Latent Space
33 State of the Art: Training 70B LLMs on 10,000 H100 clusters
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
34 The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
35 Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI
Latent Space
36 [LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
37 Synthetic data + tool use for LLM improvements 🦙
Synthetic data + tool use for LLM improvements 🦙
Latent Space
38 RLHF vs SFT to break out of local maxima 📈
RLHF vs SFT to break out of local maxima 📈
Latent Space
39 The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)
Latent Space
40 Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson
Latent Space
41 Answer.ai & AI Magic with Jeremy Howard
Answer.ai & AI Magic with Jeremy Howard
Latent Space
42 Is finetuning GPT4o worth it?
Is finetuning GPT4o worth it?
Latent Space
43 Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind
Latent Space
44 Building AGI with OpenAI's Structured Outputs API
Building AGI with OpenAI's Structured Outputs API
Latent Space
45 Q* for model distillation 🍓
Q* for model distillation 🍓
Latent Space
46 Finetuning LoRAs on BILLIONS of tokens 🤖
Finetuning LoRAs on BILLIONS of tokens 🤖
Latent Space
47 Cursor UX team is CRACKED 💻
Cursor UX team is CRACKED 💻
Latent Space
48 Choosing the BEST OpenAI model 🏆
Choosing the BEST OpenAI model 🏆
Latent Space
49 How will OpenAI voice mode change API design?
How will OpenAI voice mode change API design?
Latent Space
50 STEALING OpenAI models data 🥷
STEALING OpenAI models data 🥷
Latent Space
51 [Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!
Latent Space
52 [Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval
Latent Space
53 The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org
Latent Space
54 llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
Latent Space
55 Prompt Engineer is NOT a job 📝
Prompt Engineer is NOT a job 📝
Latent Space
56 Prompt Mining LLMs for better prompts ⛏️
Prompt Mining LLMs for better prompts ⛏️
Latent Space
57 The six pillars of few-shot prompting 🔧
The six pillars of few-shot prompting 🔧
Latent Space
58 Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Latent Space
59 [Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)
Latent Space
60 Can you separate intelligence and knowledge?
Can you separate intelligence and knowledge?
Latent Space

The video teaches the origins and applications of Reinforcement Learning from Human Feedback (RLHF) in language models, and how it can be used to improve model performance and develop chatbots like ChatGPT. It also discusses the challenges and limitations of using human feedback in RLHF.

Key Takeaways
  1. Build a language model with RLHF
  2. Collect preference data from human feedback
  3. Fine-tune the model with instruction tuning
  4. Optimize the model with direct preference optimization
  5. Evaluate the model's performance with metrics like helpfulness and safety
💡 RLHF is a powerful technique for improving language model performance, but it requires careful consideration of human feedback and preference modeling to achieve optimal results.

Related AI Lessons

Proximal Policy Optimisation — The Clip That Made Policy Gradients Reliable
Learn how Proximal Policy Optimisation (PPO) makes policy gradients reliable in reinforcement learning
Medium · Machine Learning
Deep Q-Networks — When the Q-Table Won’t Fit
Learn to implement Deep Q-Networks in Python for reinforcement learning problems where the Q-table won't fit, and understand their benefits over traditional Q-learning
Medium · Python
Reward hacking in Reinforcement learning
Learn to identify and fix reward hacking in Reinforcement Learning, a crucial step in ensuring reliable AI decision-making
Medium · LLM
Learning by messing up: A beginner’s tour of Reinforcement Learning
Learn the basics of Reinforcement Learning, from agents and rewards to the Markov property and Gym environments, and start building your own RL projects
Medium · Deep Learning
Up next
Middle Management Meritocracy: Shockingly Naive
iBankerU
Watch →