Catherine Olsson - Mechanistic Interpretability: Getting Started
Key Takeaways
The video discusses mechanistic interpretability, a subfield of interpretability that focuses on understanding the internal workings of neural nets, with Catherine Olsson from Anthropic AI sharing her insights and experiences on getting started in this field, including the use of tools like Pi smelt, svelte, and react for interactive visualizations.
Full Transcript
foreign thank you so much for being here my name is Madeline and I'm happy to welcome you to this cohere for AI Community talk cohere for AI is a research lab looking to change how research is done where and by whom and one of the ways we do this is by supporting an International Community of machine learning researchers around the world it's my honor today to introduce to you two of our community members Jen and Jonas who have volunteered and taken it upon themselves to organize this wonderful Community Series so to get things started I'll pass things off to Jen to welcome today's speaker great thank you Madeline uh yeah so it is my absolute pleasure to introduce Catherine Olsen uh so Catherine Olsen has had a great career machine learning she worked at um Google brain open AI uh the open philanthropy project and now she is at anthropic AI whose mission is reliable interpretable and terrible AI systems uh Catherine also worked on the DOTA project while at open AI which as I hope many of you know was able to not be the state-of-the-art DOTA player in the world and uh her personal interests are building High performing reliable systems by actually understanding how neural Nets do what they do and with that I'll pass it on to Catherine great um thanks so much everyone for having me so I'm gonna dive in in just a moment um on getting started in mechanistic interpretability but before I do that I would just love everyone in the chat could just say just a quick sentence about you I like knowing who the audience is so please just jump in and type a little something about where you're at or what you're interested in um while you're doing that I'll just say a little bit more about what we're going to cover today so I'm going to talk about how to get started in mechanistic interpretability you might not have heard of this specific subfield of interpretability so the first thing I will do is kind of warm us up I'm going to sort of explain what this subfield is and how you can get started in it okay so I'm just going to start jumping in feel free to keep dropping information about you there so I'll sort of see that flowing through so the overall view of what I'm going to talk about is as following so first I want to convince you the mechanistic interpretability is an exciting field that there's lots of low-hanging fruit it's approachable to generalists so to folks like yourselves and I think there's also some risks of doing this kind of work so you know there's a lot of gotchas and it's kind of a little less legible so I want to kind of sell both the pros and cons but I think if you're interested in doing independent ml research this is a good area to start in it's growing rapidly there's a lot to do so I'll give a bit of an overview from that perspective so first of all what is mechanistic interpretability so I guess I call this audience poll but I'm gonna again try and like get a little bit of participation here on like what is you know what is mechanistic interpretability so um if you're able to get a notebook or like pull up a little uh notes here like iCloud notes or something to just like think about I guess can we do Group shares here not quite so we're not going to do this as a group share we're gonna do like a silent right um but so in order to understand what is mechanistic interpretability I want to introduce us to what are the different types of machine learning interpretability and I think a given approach to interpreting ml is sort of characterized by these three questions who is doing the interpreting why do they want to interpret the model and what type of interpretation will achieve that goal so I want you all to just do a sort of Silent write for a handful of minutes just brainstorm three different use cases like three different sets of these three questions so an example might be like I'm a farmer using machine learning as a product for my apple picking and I want to understand why it's picking the unripe apples and so I want an explanation like that could be one example try and come up with a couple more so I'll just step back for a few minutes okay I think that's enough time to get some sort of juices flowing and like wrap your heads around this question so I'll just give one example of something I would not call mechanistic interpretability and then I'll explain what I think counts so one example that's actually pretty common in the literature is like the people interpreting are some kind of end user maybe it's doctors who are using some kind of prediction product or system and they're trying to automatically diagnose liver cancers or something and they want to look at the diagnosis and be like do these outputs of the machine make sense to me the practitioner the expert the end user and so they want to know in a particular case why did the model make that judgment this is not what I'm talking about this line of work is fantastic it's important it's a completely different use case than the type of work that I'm going to be talking about so I want to distinguish that I think when people say interpretability they can mean something like this which is valuable and it's a different endeavor I'm talking about a case where the people interpreting are researchers there's folks like me I've also put here my colleague Nelson elhad she's done a bunch of other fantastic work so the end users are like highly skilled people they're you yourself with your own eyeballs it's not about in this field making it accessible to some end user or lay person or anything like that and the reason we're interpreting is we want like a scholarly understanding of what the model learned to do our understanding is the thing we publish it's not some explanation shown in the UI for a user and the type of explanation we're seeking is like a complete mechanism of the behavior not a one sentence summary not a quick saliency map but a full account of how the network did what it did so I'll just pause here if there are any questions about this framework Okay cool so let's jump in so we're talking about researchers understanding what's really really going on so let's just do a quick quiz to test your understanding um so read this abstract and judge do you think this is mechanistic interpretability I'll give you a couple seconds and then do a little a little I wonder if you can do a hand raised poll Jonas is that going to work you count hands okay I think it's just going to show one just one person yes one person um can you can you guys let's Jonas let's pause for a second give people a moment to just read the abstract and then we'll try to do the poll so read for a moment all right okay I wonder if someone Brave wants to raise your hand and give your view on whether this is or isn't mechanistic interpretability and why Max yeah hi can you hear me here gotcha nice um so basically what you said before I did not think this was mechanistic interpretability as this seems to be like a way to make your model perform better based on changing it as well as um I'm not sure you sort of implied that you need to kind of have a whole uh uh for lack of a better term like reasoning chain within the neural network and I did not see anything along those lines here this is a fantastic answer exactly so I think your first cue is as you said the purpose of the explanation is to improve and so that's different than uh yes here we go okay so this is exactly good job Max um so in this case the people who are doing the interpreting are like ml Engineers trying to get a system that sort of works in practice and they want better accuracy um and they're sort of looking at the model's weights on a given feature to try and adjust that let me just go back to this text so let's see um sort of in the middle yeah right in the middle explanation methods to increase the predictive accuracy so that's the purpose of this kind of work um and then it's saying when the model has incorrectly assigned importance to some features so it's sort of helping you find overweighted features but it's not a complete chain of how that feature is getting weighted so right so this is It's interpretability I'm sure it helps a bunch of people and it's different than what we're talking about okay so let's go back to here what we're comparing you know we want researchers getting scholarly understanding and complete mechanism here's another one let's try it I'll give you time to read okay anyone want to give this one a shot yes can you hear me gotcha yes I think this is mechanics mechanistic interpretability because we're trying to understand how the model is understanding like the curves so we are trying to understand like how the model is doing with uh the uh the curves yes exactly great so that's the sort of the main thing is that the kind of understanding we're coming up with right in the middle again it's like how are they built from earlier neurons so it's a full explanation of their construction I also added the second paragraph from the intro because it says the audience is the interpretability community right the audience is the researchers generating a scholarly understanding and a highlights there's some disagreements in this scholarly Community about how the network is built so that's the couple cues you can use here um to determine that this is researchers doing the understanding communicating to other researchers and investigating a full end-to-end mechanism great so yes the interpretability community understanding how curves are detected and how neurons are built from other neurons this is what we're talking about today so I'm going to make a pitch for why I think this is a good place to start if you want to do um independent research so I'll first try and convince you that it's exciting and one thing I'll say about this style of slides that I'm using and I hope you know I'll be able to share them later is I'm just going to link out to a bunch of stuff that's already out there on the web uh in part to sort of Empower you to look if you want to go deeper um the stuff that I'm presenting most of it is already out there so exciting stuff so far I'm just going to plug the team that I've been working on WE Post our work on this website transformerscircuits.pub and we have a couple papers out where we're studying uh Transformer language models quick show of hands who here has a loose familiarity with large language models generative language models sick uh 10 people yeah Okay so we've got about half um great thank you everyone yeah so a little under half of folks um if you don't I think you might find the details I'm talking about a little hard to follow that's okay I think you can listen for the sort of general principles of type of work that I'm talking about um so I'll be a little brief about cool stuff that we've found but let me kind of just go Thing by thing and explain like why I think each of these has been sort of exciting so in this original paper that we came out with this mathematical framework for Transformer circuits I'll show you the contents on the side here we've basically been able to reverse engineer zero one and two layer Transformers almost completely uh so that was a lot of fun and we have sort of um mathematical uh Concepts that we use to sort of uh explain and describe the different components of these tiny Transformers that do generalize to larger Transformers so this was a big update for us that it's sort of possible to get handle on the full calculation within a tiny model um and then we sort of built up from there so this is work that I've done which let's see probably should have gone to the Tweet threads rather than the paper itself for these um but so this work that I did which was also incredible amounts of fun was looking at this sudden abrupt Improvement early in Transformer training uh in how well they can use early tokens to predict late tokens and in fact the very cool thing about this work is that there's this little bump that you can see anytime someone publishes a Transformer loss curve you can see this bump it's amazing every single loss curve out there in the world has this little uh little bump uh if it's training a Transformer language model so we just dove into like what causes this and at least in small models we have a handle on exactly what is forming which is a particular kind of attention head that causes this bump um thank you whoever posted that that tweet thread so this this was a real delight and sort of in a similar vein to that curve detectors paper you can see in the table contents over here we have six different arguments of how we come to conclude that this mechanism is behind this absurd behavior um there was a hand I don't know if that went away but put it back up if you want to ask something so I found this very exciting to work on where we can go all the way from a little bump in a loss curve to a complete mechanism of what kind of attention head is causing it and then I'll just sort of briefly gesture it some other work we've been doing these last two on the page are how can we make the neurons correspond to Concepts more cleanly and so the softmax linear units is one changed to models that we can make that seems to increase the amount of sort of distinct interpretable Concepts on neurons and then this toy model paper everyone loves this paper for good reason uh it's a delight it's well explained and it works on really tiny toy models these are not giant transformers you don't have to understand more than sort of a simple Auto encoder to get through this but you know we were sort of trying to understand like how is it how and why is it that uh networks pack multi features onto a neuron why isn't it always that you have the dog headner on the car wheel neuron how come you sometimes have neurons that do dog heads and car wheels what's up with that um this I think is this toy model has given us a fantastic language and framework for thinking about that that allows us to think more clearly about larger models as well so this has also been very exciting and I think overall moving from The Vision World to the language world there's been a lot of new and exciting insights here as well so that's just my quick Quick Pitch that uh exciting stuff is going on um and there's also a lot of low-hanging fruit which I think makes this a particularly appealing area um so this just barely explored stuff I'm again going to go back to our group and point to our papers um but I think any of our papers we have like an open question section and these are not I think often especially if you've like been sort of coached by an advisor how to write a paper and they're like put open questions here and you just put like random crap that you don't care about you're like maybe someone could explore this in some other application I guess um these open questions are not random crap we really feel in our heart of parts that like these questions if someone made progress on them would be uh real advances so let me just start here like is there a statistical test for catching superposition I didn't Define that because I'm going pretty quickly but that's the phenomenon where it's packing two features like the car wheels and the dog heads into one neuron we're like can we catch that in an automated way right now we just have to look at the neuron um if you could come up with a way that we could detect this automatically that would be a game changer for us um you know and there's other sort of questions like how realistic are these toy models so you could take one of the toy models um and then maybe a few layer Transformer and do some kind of compare and contrast that would also be incredibly valuable how important are these neurons if you can come up with some way of judging whether these sort of feature combined neurons are contributing more or less than the sort of simpler feature neurons that would also be amazing I'm not going to go through this whole list it goes on and on and on and all of our papers have a list like this where you know the work that I'm showing here is from my team at anthropic which is just one small team a small handful of people and so we're trying to just go area by area make a contribution and then leave a lot of threads for people to follow so I would love if people would follow up on these um like I said there there's a lot of low hanging fruit here and then I think sort of the core point I want to make for this audience is that this work is approachable to generalists now I'm very fortunate that my colleague Neil who's currently an independent researcher in the UK um just put out a tweet Thread about this so I'll do a little compare and contrast where I'll show previous work um I'll explain sort of the tools you need but I'm going to make use of this thread he posted it just this morning which saved me having to like add some more slides here so I think so Chris Ola who's the manager of the team that I'm on at anthropic has put out a lot of great work on distill in the past especially on Vision my complaint is the past work is too fancy um I think it looks a little unapproachable now it's kind of an annoying complaint to have because if you read a paper like this the building blocks of interpretability I think it's a great explanatory tool but it's like a textbook you know in a textbook the images have been laid out as pedagogically crisply as possible but there have been hours of work just into how to make that a teaching tool and sort of an enticing visualization so you look like a graph like this where they're sort of discovering like which channels are contributing to a different judgment so then you can see like oh that dog I Channel which is active on the dog eye has sort of you know 1.19 net evidence for the dog class and like you can hover over this and and sort of see and it's um it's a good learning experience but if you're doing the work you are not going to be working in this environment this is not your lab bench looks and feels like this is the published product so this might give you an impression that you need to be a data visualization expert and in fact if we look at the authors here um Sean Carter who's a researcher I really admire has worked on that paper we were just on and oh his past work is doing inner active graphics for the New York Times this is like a top tier Graphics expert you do not need to be a top tier Graphics expert and your work is just as scientifically valid if it looks terrible so um here's a realistic experience of what working with these models is like I can work in these sort of web interfaces that we've thrown together where I just have different dimensions or different neurons and I'm hovering over them and seeing where they activate and getting a feel for it it is an interactive web visualization but it's a much more approachable down to earth context and so I think what we do in mechanistic interpretability does involve building these interactive tools that we can use to understand but they can be janky and that still gets the job done so I want to just put a make a pitch for um yeah uh poor poor quality in terms of glossiness doesn't mean poor quality in terms of utility for getting your work done um as I said um Neil has this sort of guide for like what does it take to get started um I think this is fantastic I want to also just do a quick audience poll on some of these um so I'll just sort of skim through so you can see this is a short blog post um he's describing what skills do you need in order to contribute so I'll read through the skills and then I'll do a quick poll where if you know one of these skills then you can sort of sign up and say like I would be happy to Mentor or teach explain other people in this community or other people on this call if they want help with the skills we can do a little bit of peer connection so I'll read through it and then I think uh Jonas or Madeline is going to sort of run a little polling exercise so yeah Neil was like people often get intimidated um here's a guy with the essential skills so the core skills that Neil's pointing out so he thinks math wise um linear algebra you need to understand kind of what a basis is and that a vector space doesn't necessarily have a canonical basis if these words don't make sense to you yet then there's a couple channels that can get you up to speed on this real basics of probability it doesn't go very deep at all enough calculus to get what back prop is nothing more um coding I think if you can just get Python and numpy stuff working enough that you can then get kind of fast AI level machine learning that's good enough you really don't need to know how to like train really cutting edge models in fact I'm terrible at training models and I hate it and yet somehow I've had a career in machine learning um because most of the guy analyze models that exist so you don't need to be that good at training models so long as you understand them as artifacts you can do a good job um yeah Pi torch basically and I think the current Cutting Edge that I work on is understanding Transformer language models um so I think this is the section that might be most new especially I saw maybe only half fans went up for familiarity on this um I think there's some okay explanations of how Transformers work um I think it is important to notice you can ignore the encoder versus decoder frame because modern Transformers like gbt2 are decoder only so there is not this complicated double stack that everyone explains um so that's that's it there we go you know there's bonus and then there's the end but like this list is just not that long so skills wise I want to just sort of convey that in order to get started looking at neurons and understanding them mathematically it doesn't take that much Jonas are you going to run this little polling exercise yes we can do that great I wonder if the way to do this is to invite people to comment in the chat with their information if they'd be willing to share their skills with others and also just an invitation to everyone on this call to join our community and I think the Discord server that we're a part of would also be another wonderful place to connect with one another and share skills in this area it's great cool so I'm just gonna go skill by skill raise your hand if you feel comfortable with people asking questions of you about this um so let's just go with any of these math skills then your algebra probability Calculus if you know any of these and feel like someone else on this call could reach out raise your hand quick question from Max is what's SVD that's the singular value decomposition here yeah Okay cool so you have a handful of people who can answer maths questions amazing um let's just jump through to coding Python and numpy um if you feel like you could help people with that please raise your hand amazing um let's do these first two machine learning bullets so rough grounding and machine learning Pi torch basics I'll let hands toggle for a while amazing wonderful set of hands and then Transformers and I think because this one's a little unusual especially if you are able to answer questions about Transformers please put your name in the chat or any contact details you're comfortable with okay amazing so yes wonderful people on Discord and I think this is recorded So if you want to come back and look at that list of hands or you know otherwise understand who that was um that'd be great so you have people in this community who can be a resource to you on this you don't need to be contacting experts on the topic um so resources available amazing thank you Max and to everyone who's showing up with offers of help so this is all to say that the set of prerequisites is just not that hard and then I'll release I'll tweet out um later today I threw together this very poor quality tutorial about price felt but better uh better to release than nothing so Pi smelt is a library that we've released an anthropic I'll just click through to the GitHub um which allows you to sort of move your data between uh Python and JavaScript so I'll just highlight this one paragraph that explains basically what it is you know the vast majority of deep learning research is done in Python um there's sophisticated libraries great however if you want to visualize stuff these web standards HTML JavaScript CSS are sort of the right library for putting things on a page having clickable buttons and so on so how do we deal with this we don't want to try to use JavaScript to train models um and I think for interactive clickable visualizations python is a bit of a tricky place to do that so um what do we do we attempt to bridge the ecosystems so we have this library that allows you to take your data in Python throw it over to an HTML web page svelte is a library like react but much simpler to use um that you can so you can use this library to get your data from python to a visualization like that um so I you know I won't go into too much more detail about um about this but I'll just sort of release it and say if you want to learn how to use this library in order to get um interactive visualizations of your own running I will release this little tutorial the library is already out there on GitHub okay so here's sort of another realistic example of someone's workflow again Neil has been tweeting up a storm I'm very grateful because it makes my job easier um Neil says I've been learning basic web dev recently I've been pleasantly surprised by how useful being able to make tacky interactive visualizations is for mechanistic interpretability here's a shitty tool I hacked together on how to explore a GPU 2 neuron that activates on text I'll just let you watch this animation for a second so he's made this text box where he can type and then it's clearly pulling in the background um a server where it's getting activations from gbd2 a pretty small model and then showing coloring the tokens by their activation on this particular neuron so he's doing kind of a deep dive on a specific neuron what it activates in response to how it behaves and he's just handwritten this little tool for himself so what I want you to take away from this is that these are simple JavaScript components text boxes colors background colors you do need to know a little bit of HTML and CSS to do this it's just not that much um and it doesn't need to be as beautiful as a distilled paper sorry a little bunch of some like a heart face but just that so yeah this is not as beautiful as a distilled paper it's perfectly usable I think this is very approachable stuff and so I'm grateful to Neil is my colleague for um banging the drum on approachable tools for doing this kind of work so bumping back over here um one question you also might have is like okay that's skills that's tools what about compute how many gpus do I need to do this work um so I want to just bump us over to this work that I did oh that's interesting I uh mistyped something let's figure that out um here we go okay so this is work I did that to do this work took a lot of gpus but any given model that I'm studying is quite small so let me go to these sort of um ablation studies right this is very GPU heavy where each of these lines is a model that I've trained and I've measured something about it at like every single step of training and I have a bladed head it's like each of these is a single model right so I've gone kind of over and over again with lots and lots of different models um so this kind of like multi-scan ablation work is going to take um you know a large number of gpus but you don't have to do work like this so one thing I mentioned is like look at this this is like a single one layer and two layer model you could build this kind of visualization um if you had just a single GPU on collab available to you you could train a one or two layer uh Transformer with no problem so even in this sort of large Behemoth of a paper where I did scan after scan and sweep after sweep I was just filling our cluster with truly hundreds and thousands of tiny tiny uh Transformers but any given tiny Transformer you can run on one GPU and you can uh inspect these neurons in these properties um I'll also say the um the toy modeled work these toy models are even smaller in some cases um and also can be done with very minimal resources um another thing is just anything that runs on a GPU you can study so you don't necessarily need to pay the resources to train the thing right I was just showing you that both the toy models work and the induction heads work are helpful if you can train over several time steps your own uh tiny model but you can also study already trained models and then you just need the ability to run at inference time a language model which is much cheaper because you don't need to spin it up and teach it everything it doesn't know um so for inference time work any model where you can access the neurons and access their activations is enough that you can start studying it so that's another pitch that I want to make here okay so compute wise I think there's plenty of questions that are accessible with a small amount of compute this is another reason I'm really trying to pitch this group on this kind of work um so a couple difficulties if you're starting on this kind of work as an independent researcher as I say you're four and five there's gotchas so there's stuff that might seem mathematically intriguing or intuitive to jump into and turns out to actually be like kind of confused or not well-founded way to think about it um sorry I'm still a little distracted by all these light beams streaming at me okay there we go um so this can be tricky and I think if you're doing work like this um it's helpful to start getting feedback early so you avoid this kind of thing so here's one gotcha um if we talk about studying neurons okay here I just cropped out a little chunk of The Illustrated Transformer and you might ask like where here are there neurons I can study um and so we do study in these sort of the boxes labeled feed forward that's where you have what we eventually call neurons these are you know vectors where each entry of the vector then has a non-linearity after it it's like it has a firing function or an activation function and those are sort of Entry by entry in the vector has its own activation function and that's why you can study each entry one at a time because they're independently operated on and they have what we call a privileged basis so that's why we study neurons in these feed forward layers now we've had people approach us and say do you want why don't you study neurons in the residual stream this is kind of a poor diagram of it I'm sorry I couldn't draw a better one you can see these dashed lines that are kind of circling around I think a Transformer is often well thought of it's just there's kind of this residual stream that's constantly flowing through people are like why don't you study those dimensions and the first objection would be like it's not a privileged basis it's just a big Vector the elements of the vector don't individually mean anything so why so that's kind of the first you know it's a first pass no and I've just dropped this link here where we explain in one of our papers what is a privileged basis I'll just read this out so a privileged basis occurs when some aspect of a model architecture encourages the neural network features to align with the basis Dimensions the individual entries of the vector um for example blah blah blah um some type of interpretability only makes sense for activations with a privileged basis it doesn't make sense to look at the neurons in the residual stream okay so this is sort of a first pass where people ask us this and we're like no it doesn't make sense okay it gets more complicated okay so actually sorry before I say it gets more complicated I just want to throw this diagram in here where this is just a two-dimensional Vector space and these are two different features maybe it's like the cat feature and the dog feature or something and these features are represented as vectors they're orthogonal vectors but they don't line up with like the x coordinate and the y-coordinate so this is what I mean what I'm saying like doesn't line up with the basis vectors okay so again thank you Neil for tweeting all this stuff which makes it very easy for me to show off um then there's this question of like ah but maybe there is a floating maybe there is a basis a privileged basis in the residual stream because of the way floating Point arithmetic works and so I've linked you know there's sort of this post that he links out to and this other question like you might start working on this and be like is there or isn't there a privileged basis in the residual stream um and I just want to stop here and be like this is sort of The Cutting Edge of our understanding of like whether you can or can't think of those neurons as meaningful and if you can it's not because of the math of the Transformer it's because of leaky abstractions and so I just want to back out and be like okay there's gotchas here this is complicated and if you want to do work that is mathematically well grounded um it can be a little tricky to make sure that the concepts you're thinking of that you have your ideas perfectly uh clear and crisp and so if you want to work on this kind of stuff I suggest tweeting about it or writing little one pages of your understanding and showing them around and getting people to look over and say like am I thinking about this right here's a few paragraphs of my thinking getting people to check you don't go months and weeks and months down a rabbit hole of some research idea before you check with other people in your community that you are thinking clearly about it and that your Concepts uh you have the right assumptions about the basic concepts you're using so that's gotchas the other concern you might have is just is this stuff legible as normal research you know if you're trying to build legible credibility the kind you'd get with a machine learning PhD um I think there are pros and cons of trying to do that with this work so you might as well what can I publish if I do this kind of work so there's formats right archive papers plenty of folks have published archive papers on mechanistic interpretability there's now a long history of interactive websites which I think are a fantastic way even if they're a little janky to sort of get people's hands on the stuff that you found there's a lot of Twitter threads again Neil just published some of like here's a cool thing that I found um there's not a lot of like journal and Conference published work out of this community okay why is that um here's why so the kind of findings range from I found a cool neuron and let me just put a gold star on me on this and be like this is a wonderful finding if you just find a cool neuron it's the equivalent of like being a biologist and you walk out in the field and you're like I found this bug that might be a different species it's not the same as like a full field survey to fully characterize the new species and prove that it's a new species but it's a great starting point it's vital and so if we had more folks who were just looking through neurons and models being like what is this this is like an alphabetized list neuron and then you throw that on your blog or you tweet that I mean tag me if you do um I love this it's wonderful it's a meaningful real scientific contribution it's something people can follow up with to confirm um so okay but I found a cool neuron is not going to get you published in Europe's I'm so sorry um so there's just like the scope the scale and scope of what is like an entire paperworthy contribution uh is larger than the scale and scope of what I would find cool to read on your blog or Twitter and what I would love to see and then they're sort of getting into like here's an explanation of a mechanism here's how a curve detector Works here's how induction heads work here's what they do um there's no state of the art on anything even if you write uh like you know long textbook chapter like explanation of the mechanism you found I think it's a little unclear as to which conferences in which reviewers will understand this as a contribution if they are used to seeing contribution it's like here's my new technique and here's how the numbers went up um or here's some performance evaluation these don't come with performance evaluation and so I think it's a little unclear which reviewers are going to find that in scope or be able to understand it so that's part of why this community has been developing somewhat outside of conventional conferences in Academia uh For Better or For Worse and you know I've been glad to see more people submitting archive papers which at least starts to get things in a format that then could be submitted to journals and conferences um but I just you know I want to like pose a real warning that for all that it's uh great and approachable but just finding cool neurons is valid research contributions it also might not give you something uh with the stamp of approval that you're looking for from the conventional authorities um although I will just reassert that um getting a community of your peers to look at what you're doing is very important so even though I'm describing you know interactive websites Twitter threads let me see if I can just bump over um we do get all of our stuff pre-publication peer reviewed um if you go to the bottom of any of our papers sorry it's a long long long long long long paper um oh we put at the bottom comments and replications that we've solicited from our research community so we send it out to folks we get comments in um and we get people to replicate our key findings just to make sure that we are confident in what we're putting out before we put it out there this step is really important um and you know when you put out a blog post or something like this is important to sort of say here's what I did here's who I had look at it put in your acknowledgments who looked at it um solicit any comments that you can put it out there in a space where people can uh chime in I think this sort of um scholarly Community discussion and validation is important so even if we're not doing this through uh a publication venue that's then assigning us three different reviewers and scores uh we are nonetheless getting this uh reviewed by peers I think that's key and especially if you're an independent researcher finding your other channels to get feedback um it's really important so yeah and you can also check out I think many of these replications come with their own code or they might be uh accessible on publicly available models so if you want to work if you want to take us a jumping off point some of our work this can be a good um starting point so let's see I think I'm pretty much done I just want to kind of review overall where we've come so I hope I've convinced you what this field is that it's exciting um that there's a lot left to do genuine open questions the skills needed are approachable for folks with a basic technical background there are gotchas and there's this legibility problem where it's maybe tricky to get a published paper that's in a format other people would expect I think nonetheless there's like a growing Community many of whom are on Twitter who are happy to chime in uh and support you so if that's appealing to you I think the approachability and the excitement is a good selling point as well so that's me on getting started in mechanistic interpretability thanks for showing up awesome um if you have any question you can put it in the in the Q a um currently I don't see any question but I was just wondering uh earlier on you showed that there are some open questions that if someone wanted to let's say address any of them what's the best way for them to reach out about those questions great yeah um feel free to email me I can just drop if I put in the chat here is that going to work um I'm Cathedral anthropic.com um I will try and find whoever on our team is best posed to help you out although I'll stay we're busy so probably you might just get one back and forth from us with a quick pointer uh you probably will not get like an extended mentorship relationship but we are happy to point you to resources or people um this is part of why I suggested um trying to get mentorship from the individual sub skills um that are part of this so if you're having trouble building a visualization find someone who knows JavaScript uh you don't need someone who's an expert in these particular questions and then I would suggest once you have kind of early findings or kind of uh um Direction you're going I think Twitter is a fantastic place um to go where there's enough people out there who are been able to read through what you're saying I think also if you have you know blogs or newsletters um those kind of things work well although you don't need to have that but I would suggest yes Twitter is a great sort of a constantly open Community forum [Music] um yeah I mean and again like if you have even just a one-pager with your findings if you have done a rigorous analysis even if it's very small scale and you're confident in your findings I think putting one pages on archive is undervalued um it doesn't need to be like conference type to be a valid uh contribution so I think consider also putting your one pages on archive like I said if they are rigorous and you feel confident in them I think that's a good thing to do as well right yes that's fantastic uh if you have a question you can just raise your hand um um maybe I I can also ask for now and just let's give people a moment let's see if something comes up yeah two new questions in the Q a Jonas are you able to pull those up all right I've got these okay so let's see okay I've got one so from three how domain agnostic is the research being done in this space has there been any work done in this direction a transform exposed to vision and maybe audio I don't know of any mechanistic interpretability work on Transformers not for text just none yet so speaking of low-hanging fruit I think if you were to find any basic Transformer on vision and audio um I just see does it even have induction heads I think that would be a fantastic and very approachable question okay Max um my intuition is that adding vectors such as after an attention softmax layer loses information and hence wouldn't be super interpretable is this intuition that adding is worse than multiplying matrices correct or am I on the wrong track Max can you just chime in verbally because I'm not understanding how adding vectors loses information and maybe we can get a little higher bandwidth here sure hey so my introduction is that you know combining magnitudes of vectors kind of just sticks you in some area of the whole embedding space as opposed to sort of you know more gracefully combining them and I'm also um you know when you compute gradients through things it makes more sense in my mind when you sort of are multiplying them to to see those gradients flow backwards so um uh so those are my questions it's like wait isn't adding um wooden adding vectors then lose you a lot of stuff and thus uh specifically I felt like it would be related to this interpretability where if you kind of add them together you would also kind of lose some some back prop but again it's I'm not quite sure if this is a correctly posed question at all so um yeah I think I'm gonna try to answer you both impressionistically and literally and see if that'll land for you great so um literally right the intention the so if you just have some pile of vectors then you add some other pile of vectors like you in some sense you're right you have lost information but it's sort of like um you know you haven't necessarily deleted the magnitude on the original uh Direction you sort of then just like contributed additionally another Direction so it might be hard to then decompose like you know if those were not orthogonal what fraction of one of the other um so I think theoretically like that's true you've sort of lost like a um truly unique decomposition that you can check out um that's going to be true of course anytime you have non-linearities so all of your neural network is full of places where you're just losing um information in some sense I think in practice it seems like often the residual stream keeps what was already there and then piles on additional stuff so if you're the projection along the vectors that were kind of already present largely remains present and then other projections also uh are added although impressionistically it seems like some layers are doing something like deliberately removing or deliberately deleting information um that makes sense because there's um in a Transformer right each each layer is moving from more like sensing and interpreting the original context so far through to outputting uh the next token right and so as that representation transforms it needs to become more and more narrow towards what do I say next what's the next output and I think that narrowing is necessarily through the layers going to drop information that's not action relevant for the Transformer yeah that makes sense you can analyze it at different layers and then see how does this representation evolve um you know we do have again I'm like the expert on the induction heads paper because I was pushing that one forward and we do have a sort of analysis of how do induction heads form like in what layers do they do they form um and there's kind of these middle layers where it's doing these sort of more complicated things and those kinds of heads are not uh present towards the end I think remains to be seen it could be a fun little analysis to be like the representation output by the previous token header the induction head does it get deleted by the end or does it just sit there it just doesn't matter I think these are interesting questions as to impress practice does the information get deleted um yeah I think that's kind of pragmatically an open question nice thank you so much yeah uh what exactly information it makes a lot of sense to me that like you you have something that you end up com you end up Computing an error on that you gradient backwards through so you know eventually you need to kind of narrow down towards that um idea and uh things will lose you'll lose information thank you so much yeah of course more questions please these are good ones okay here's what I've got is interpretability of high dimensional data possible considering that we only understand four total dimensions I think you mean like three dimensions in time um wonderful wonderful question so there's two things we can do to get a handle on high dimensional data um one is as I've repeated over and over and I won't dwell on if you have a privileged basis such that if you have 300 Dimensions you think the individual Dimensions matter you can just with your eyeballs go Dimension by Dimension um this works in small to medium-sized Transformers in a small transformer I can and have just look at every attention head and perhaps even every neuron um that will run out once you have a 60 layer Transformer with 30 000 neurons per layer that's going to stop working so then what do you do um so yeah I guess all that to say something like 56 Dimensions is still tractable if there are privileged faces because you can look at them one at a time um if you have more than that or it's not a privileged basis then you need some other approach that's almost always going to be dimensionality reduction I think until you've tried it don't rule out that a regular PCA or honestly we often use nmf which is a non-negative matrix factorization a basic PCA or nmf can easily give you broadly what's going on in this Vector space other things that are helpful as a umap so let me just type these are saying a PCA and a map and umap these are built in in scikit learn you don't need to write them yourself there are tools to do it if you have a bunch of vectors in a vector space and you want to see roughly how they are clustering um I think a umap is a good way to do that these can just get you basically oriented but I think if you want to go further you need a little bit like a little bit of mathematical rigor to then pose what question you're trying to ask and see how you can thread that question through a high dimensional space and get an answer there's another question yes let's do it yes great Haley's the last question thanks Haley okay I'm curious what intuitions you have on Bridging the Gap between toy models toy data settings and realistic sized elements llm's train on scraped data how would you advise applying theories or characteristics you've seen in small models to large ones to see if your analyzes pulled especially curious how to think of the layer Norm here for large Transformers okay um layer Norm sucks I'll come back to that uh how so I think the best we've done is if you look at the induction heads paper um we sort of say we've done a rigorous analysis on the small Transformers and we have suggestive evidence of how it scales up to large Transformers I think that's going to be the Paradigm for a little while at least where you have statistical correlations that seem to line up you can say this thing that we can basically prove is going on in the small Transformers look similar looks roughly the same um there's I think always going to be some of that Gap but I think as we see more work like this and I would love to see more work similar to that induction heads paper that's doing those correlational and sug
Original Description
The Cohere For AI community was honoured to welcome Catherine Olsson to discuss the process of getting started in mechanistic interpretability.
Here is a link to Catherine's sides: https://docs.google.com/presentation/d/1BNY1xaJLBfMzcgrY_zjqtUAu1QXlzkbbhOlV7XVVlC4/edit?usp=sharing
Links referenced throughout the talk:
Transformer Circuits Thread: https://transformer-circuits.pub/
Twitter thread from Anthropic: https://twitter.com/anthropicai/status/1541469936354136064?lang=en
Twitter thread from Neel Nanda: https://twitter.com/NeelNanda5/status/1584648065759465472A Barebones Guide to Mechanistic Interpretability Prerequisites: https://www.neelnanda.io/mechanistic-interpretability/prereqs
Slido: https://www.slido.com/
Learn more about Cohere For AI and our Community at https://cohere.for.ai/
This session is brought to you by the Cohere For AI Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. Thank you to our Community Leads for organizing and hosting this event.
If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker.
Join the Cohere For AI Open Science Community to see a full list of upcoming events: https://tinyurl.com/C4AICommunityApp.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Cohere · Cohere · 10 of 60
1
2
3
4
5
6
7
8
9
▶
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Andreas Madsen on Independent Research and Interpretability
Cohere
Plex: Towards Reliability using Pretrained Large Model Extensions
Cohere
Independent Research Panel Discussion
Cohere
The Future of ML Ops: Open Challenges and Opportunities
Cohere
C4AI Special - Grad School Applications
Cohere
Cohere For AI Fireside Chat: Samy Bengio
Cohere
Cohere For AI - Scholars Program Information Session
Cohere
Modular and Composable Transfer Learning with Jonas Pfeiffer
Cohere
Jay Alammar Presents Large Language Models for Real World Applications
Cohere
Catherine Olsson - Mechanistic Interpretability: Getting Started
Cohere
How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
Cohere
C4AI Sparks: Samy Bengio
Cohere
BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
Cohere
Exploring News Headlines With Text Clustering | Jay Alammar
Cohere
Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Cohere
Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Cohere
Intro to KeyBERT - BERTopic for Topic Modeling
Cohere
Intro to PolyFuzz - BERTopic for Topic Modeling
Cohere
API Design Philosophy - BERTopic for Topic Modeling
Cohere
Code demo of BERTopic - BERTopic for Topic Modeling
Cohere
Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Cohere
How People can help BERTopic - BERTopic for Topic Modeling
Cohere
Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere
Cohere API Community Demos | October 2022
Cohere
Perfect Prompt Demo By Arjun Patel
Cohere
Project Idea Generator Demo By Tobechukwu Okamkpa
Cohere
SuperTransformer Demo By Amir Nagri and Team Megatron
Cohere
Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere
How Startups Can Use NLP to Build a Competitive Moat
Cohere
Build Chatbots Faster with Large Language Models
Cohere
Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Cohere
Utku Evci - Sparsity and Beyond Static Network Architectures
Cohere
Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Cohere
Iterating on your data with doubtlab - Tools to Improve Training Data
Cohere
Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Cohere
Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Cohere
Building Cohere API Demo App With Streamlit | Adrien Morisot
Cohere
Rosanne Liu - career creation for non-standard candidates
Cohere
Giving computers many human languages with Cohere's multilingual embeddings
Cohere
Learning by Distilling Context with Charlie Snell
Cohere
Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Cohere
Reflecting on for.ai...
Cohere
Create a Custom Language Model with Surge AI and Cohere
Cohere
Cohere API Community Demos | November 2022
Cohere
Cohere API Community Demos | December 2022
Cohere
Cohere For AI Presents: Colin Raffel
Cohere
Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Cohere
What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
Cohere
Evaluating Information Retrieval with BEIR
Cohere
Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Cohere
High quality text classification with few training examples with SetFit
Cohere
Multilingual and cross lingual embeddings - Nils Reimers
Cohere
Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Cohere
Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Cohere
HyperWrite Powers Its Generative AI Service with Cohere
Cohere
EMNLP 2022 Conference Special Edition - Talking Language AI #4
Cohere
Cohere API Community Demos | January 2023
Cohere
C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
Cohere
Michael Tschannen - Image-and-Language Understanding from Pixels Only
Cohere
How to Add AI to your App
Cohere
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Moat Paradox: The Better Models Become, the Less Models Matter
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
Medium · AI
🎓
Tutor Explanation
DeepCamp AI