🔬Generating Molecules, Not Just Models

Latent Space · Advanced ·🤖 AI Agents & Automation ·4mo ago

Skills: Agent Foundations90%LLM Engineering80%Tool Use & Function Calling70%ML Maths Basics60%

Key Takeaways

This video discusses the advancements in protein structure prediction and molecular interaction modeling, from AlphaFold2 to the broader landscape of protein design and drug development, using tools like AlphaFold, AlphaFold 2, AlphaFold 3, and Multiple Sequence Alignment (MSA).

Full Transcript

Actually, we only trained the big model once. Uh that's how much compute we had. We could only train it once. And so, like while the model was training, we were like finding bugs left and right. Uh a lot of them that I wrote. >> And like I would I remember like us like sort of like you know doing like surgery in the middle like stopping the run, making the fix, like relaunching and um yeah, we never actually went back to the start. We just like kept training it with like the bug fixes along the way. Uh which was >> impossible to reproduce now. Yeah. Yeah. No, that model is like has gone through such a curriculum that you know it's learned some weird stuff. Uh but uh yeah, somehow a miracle it worked out. >> It's a pleasure to have with us today Gabriella Corso and Jeremy Vulvin. They are they recently founded Boltz, a company trying to democratize and bring art, structure prediction and biology to you know the masses. uh they were both uh recent PhD grads from MIT and have been working on all sorts of foundational papers in like generative biology. Um anyway, uh pleasure to have you here. Thanks for coming. >> Thank you. >> Thank you. >> Thank you. >> Uh I guess we're maybe what 6 years post Alphold 2 right now, which was like kind of a big moment. >> Um is that right? Yeah. >> I think was it 2021? >> So yeah, on going on 5 years. >> 5 years. 5 years. Yeah. Um, yeah. So maybe for the audience like can can let's go back to that moment in time and explain like what was this big moment and why was it interesting? Why was everyone so excited and I think you two were probably quite excited. So why were you personally excited? >> I would start on kind of why that was interesting kind of you know from a scientific standpoint. So Alpha so maybe first as a kind of introduction for uh the ones in the audience and not structural biologists. So the idea of structural biology is that you know we want to try to understand how you know proteins and other molecules take shape inside our cells and you know how they interact and structural biology is sort of this beautiful discipline uh where we are somehow able to understand this minuscule structure at know kind of atomic details using uh these incredibly um complex methods like you X-ray crystalallography and you know the the dream has always been of computational biology. Can we understand kind of the structures without having to you know resolve this crystal you know shoot X-rays and so on. And so Alphafold um was a real breakthrough in this problem of protein folding which is trying to understand the structure of a single uh protein. And to me it was exciting across kind of many dimensions. One I was a computer scientist. I was working a lot on machine learning. And I saw kind of the impact that kind of the work similar somewhat similar to what I was doing could have on like a longstanding scientific problem. And on the second perspective from a more you know personal side, the seeing kind of the structures coming out of these models where you know you see kind of this beautiful you know um creation of life is something that was was very inspiring to me and so that was kind of one of the things that led me to start uh working on uh structural biology and in particular with machine learning. >> Were you a structural biologist before Alpha Flood came out? I mean did you you did machine learning but it was not in structural biology so that actually shifted your career quite dramatically. >> Yeah very dramatically. I was I was working on some pretty kind of theoretical methodological things and I was starting to see kind of you know some of the challenges in you know kind of doing somewhat theoretical or methodological work and you know seeing kind of the potential impact of you know doing excellent you know alpha fold was really a machine learning breakthrough but you know and applied machine learning and so that led me to uh want to start working in applied ML. our our group at the time was um working a lot on like small molecules already and um I think alpha is kind of what triggered I think this shift to like working on on biologics um and at the time I think it like opened as many questions you know as it answered in a sense like we um the immediate follow-ups were okay like can we do this on other things than proteins can we do um you know interactions of small molecules with proteins nucleic acid with proteins Can we model more complex protein systems? And I think yeah very rapidly I think after alpha fold uh people realized I think that there was you know machine learning could have a could really yeah sort of target this problem very differently than you know than previous methodologies >> clarification. So what what does small molecule mean? What does protein mean? What is you know the terms that you just mentioned? >> Yeah maybe we can start with protein. Um, so you know, protein is is maybe the most fundamental one. It's what gets decoded out of our DNA. >> Um, it's uh essentially a sequence of uh amino acids. Each amino acid you can kind of consider as a we call a small molecule. Um, and there's 20 of them in the at least in the human body. Um, and you know any compositions of these 20 amino acids in a sequence um you know creates a different form of a protein. Um and so you know obviously they are a very large number of those sequences that you can create. Um small molecules are sort of you know following the name um typically considered to be um you know a much smaller number of atoms. Um and the atoms that compose them I think are also generally a bit more diverse right amino acids have um you know this composition and it's always the same. uh with small molecule you know there's a larger set of possible atoms that also we have to consider that also make the problem uh pretty challenging and then we have nucleic acids so DNA and RNA uh which are also very interesting to model the structure for and those a little bit more similar to proteins you know they're sort of composed of four um nucleic acids and you form sequences from them and um any uh codon which is like three uh nucleic acid um translates into a specific amino acid. Um so yeah, different forms of molecules at the end of the day just a bunch of atoms uh you know that are bonded together uh that we try to understand the interaction of. >> Going back to the alpha fold 2 moment like um I remember this very well. I was at uh Nurifs when I guess the results of this famous competition came out. So um you can you want to talk about CASP and like what it is and why it is it was so interesting and exciting. >> Yeah, I think every so um every couple years um and the goal has always been to you know find u protein structures that are a little bit different from what's known. So CASSP over the years has like you know put in a lot of effort to like gather structures from you know academic groups and uh even industry groups uh to try to create sort of a test set that would be um difficult um for uh different methods and CASP uh 14 was when uh Alpha 2 really you know blew everything out of the water. Um and the the improvement was so large over you know the previous previous method and also over the previous competitions. Um and now CASP continues you know we've had CAS 15 we have CAT 16 and you know sort of what's happened now is that it's really expanding to also all these other modalities like I was mentioning like protein small molecule nucleic acid and u but the goal remains to like you know really challenge the models like how well do these models generalize and you know we've seen in some of the latest GAS competitions like while we're become really really good at proteins basically monomeic proteins um you know Adamal is remain pretty difficult. So it's really essential you know in the field that there are like these efforts to um you know to to to gather um you know benchmarks that that are challenging so keeps us in line you know about what the models can do or not. >> Yeah. >> It's interesting you say that like in some sense cast you know at cast 14 a problem was solved and like pretty comprehensively right but at the same time it was really only the beginning. So can you explain like what was the specific problem you would argue was solved and then like you know what is remaining which is probably quite open. >> I think I think we'll we'll steer away from the term solved because we have many friends in community who get pretty upset at that word and I think you know fairly so. Um uh but the the problem that was um you know that a lot of progress was made on um was the ability to predict the structure of single chain proteins. So proteins can like be composed of many chains and single chain proteins are you know just a single sequence of amino acids and uh one of the reason that we've been able to make such progress is also because um we take a lot of uh hints from evolution. So the way the models work is that you know they sort of decode a lot of hints um that that comes from evolutionary landscapes. So if you have like you know some protein in an animal and you go find the uh similar protein across like you know different organisms uh you might find different mutations um in them. And as it turns out if you uh take a lot of these sequences together and you analyze them you see that uh some positions in the sequence tend to evolve um at the same time as other positions of the sequence. sort of this like uh correlation between different positions and um in it turns out that that is typically a hint that these two positions are close in three dimension. So part of the you know part of the breakthrough has been like our ability to also decode that very very effectively. uh but what it implies also is that you know in absence of that co-evolutionary landscape the models don't quite perform as well and so you know I think when that information is available maybe one could say you know the the problem is like somewhat solved from the perspective of structure prediction >> when it isn't it's it's much more challenging and I think it's also worth also differentiating the um sometime we confound a little bit structure prediction and folding folding is the more complex process of actually understanding like how it goes from like this disordered state into like a structured like state and that I don't think we've made that much progress on but the idea of like yeah going straight to the answer uh we've become uh pretty good at. So there's this protein that is like just a long chain and it folds up. Yeah. And and so we're good at getting from that long chain in whatever form it was originally to >> the thing, but we don't know how it necessarily gets to that state and there might be intermediate states that it's in sometimes that we're not aware of. >> That's right. And and that relates also to like you know our general ability to um model like the different you know proteins are not static. they move, they take different uh shapes based on their energy states. And I think we are also not that good at understanding the different states that the protein can be in and at what frequency, what probability. >> Um so I think the two problems are quite related in some ways. Um so yeah, still still a lot to solve. Um but I think the it was yeah I think I think it was very surprising at the time you know that uh even with these evolutionary hints that we were able to you know to make such such dramatic progress. >> So I want to ask why does the you know sort of like intermediate states matter but first I kind of want to understand why do we care what proteins are shaped like? Yeah, I mean the proteins are kind of the machines of uh our body. You know the way that all the processes that we have in our cells, you know, work is typically through proteins, sometimes other molecules sort of intermediate, you know, interactions and through that interactions, we have all sorts of cell functions. And so when we try to understand you know a lot of biolog how our body works how disease work we often try to boil it down to okay what is going right in case of you know our fun normal biological function and what is going wrong uh in case of the disease state and we boil it down to kind of you know proteins and kind of other molecules and their interaction. And so when we uh we try predicting the structure of proteins, it's critical to you know have an understanding of kind of those those interaction. It's a bit like um seeing the difference between having kind of a list of parts that you would put it uh in a car and seeing kind of the car uh in its final form. You know, seeing the car really helps you uh kind of understand what it does. Yeah. >> Uh on the other hand, kind of going to your question of you know why do we care about you know um how the protein folds or you know how the car is made uh to some extent is that you know sometimes when it something goes wrong you know there are you know cases of you know proteins misfolding in some diseases and so on. Um if we don't understand uh this folding process we don't really know how to uh intervene. >> Okay. And so do proteins when they're in the body, do they are they typically in that folded state or are they kind of just like you know doing whatever until they're in a location where they need to interact with something? That's a great question. Uh and it really depends on the protein. Uh it depends on basically the stability of the protein. There are some proteins that are very stable and so once they are produced you know from the ribosome they sort of fold in this shape then more or less they keep that shape with a minor variations. >> The ribosome is the part of the cell that actually translates and and turns DNA to RNA to proteins. >> RNA to proteins that final part of RNA to proteins. >> And so once they come out they're pretty stable. Uh but then on the other hand there are some that you know for example have multiple states that they switch to depending on their environment. You know uh the bi biologists really figure out some incredible machines. Uh there are machines where you know proteins where you know depending on whether for example another molecule is present not they will take different shapes and that different shape will give it a different function. And so we have this you know so-called fault switching uh proteins that take multiple and we have some proteins that are completely disordered and these disorder proteins are actually pretty important in kind of many diseases and those are kind of ones of the ones that we have you know the least understanding of >> there's this nice line in the um I think it's in the full 2 manuscript where they sort of discuss also like why we even hopeful that we can target the in the first place. And then this this notion that like um well four proteins that fold um the folding process is almost instantaneous which is a strong like you know signal that like yeah like we we might be able to um you know predict that this very like constrained uh thing that that the protein does so quickly. Um, and of course that's not the case for, you know, for for all proteins and there's a lot of like really interesting mechanisms in the cells, but um, yeah, I remember reading that I thought, yeah, that's somewhat of an insightful insightful point. Um, yeah, >> I think one of the interesting things about the protein folding problem is that it used to be actually studied and part of the reason why people thought it was impossible, it used to be studied as kind of like a classical example of like an MP problem. uh like there are so many different you know type of you know shapes that you know this amino acid could take and so uh this grows combinatorily with the size of the sequence and so there used to be kind of a lot of actually kind of more theoretical computer science thinking about and studying pro problem protein folding as an MP problem and so it was very surprising also from that perspective kind of seeing machine learning So clear there is some you know signal in those sequences uh through evolution but also through kind of other things that you know us as humans we're probably not really able to uh to understand but that this uh models have have learned. Yeah. So and Andrew White we were talking to him a few weeks ago and he said that he was following the development of this and that there were actually uh AS6 that were developed just to solve this problem. So um yeah that like and that there were many many many many millions of computational hours spent trying to solve this problem before alpha fold. And just to be clear um one thing that you mentioned was that uh there's this kind of co-evolution of um mutations and that you see this again and again in different species. So explain why does that give us a good hint that they're close by to each other? >> Yeah. um like think of it this way that you know if I have you know some amino acid that mutates um it's going to impact everything around it right in three dimensions and so it's almost like the protein you know through several probably you know random mutations in evolution like um you know ends up sort of figuring out that this other amino acid needs to change as well for the structure to be conserved. Uh so this whole principle is that the structure is probably largely conserved you know because there's this function associated with it. Um and so it's really sort of like different yeah different different positions compensating for for each other. >> I see. So the the those hints in aggregate kind of give us a lot of information about what is close to each other and then you can start to look at what kinds of folds are possible given the structure and then what where where what is the end state and therefore you can make a lot of inferences about what the actual total shape is. >> Yeah, that's right. It's almost like, you know, you have this big like three-dimensional valley, you know, where you're sort of trying to find like these like low energy states and um there's so much to search through that's almost overwhelming. Um but these hints, they sort of maybe put you in an area of the space that's already like kind of close to the solution, maybe not quite there yet. And and there's always this question of like how much physics are these models learning, you know, versus like just pure like statistics. And like I think one of the thing at least I believe is that um once you're in that sort of approximate you know area of the solution space then the models have like some understanding you know of how to get you to like you know the low energy uh low energy state and so maybe you have some some light understanding of of of physics but maybe not quite enough you know to to know how to like navigate the whole space well. So we need to give it these hints to like >> get it into the right valley and then it finds the the minimum or something. Yeah. >> One interesting uh explanation about how free works that I think it's quite insightful of of course doesn't cover kind of the entirety of of what does that is um that I'm going to borrow from uh Sergey Chico at MIT. And so he sees kind of alphaold and the interesting thing about Alphaold is got this very peculiar architecture that we have since you know um used and this architecture operates on this you know pair-wise context between amino acids and so the idea is that probably the MSA gives you this first hint about what potential uh amino acids are close to each other. >> MSA is m >> multiple sequence alignment. Exactly. This evolutionary exactly this evolutionary information >> and you know from this evolutionary information about potential contacts then is almost as if the model is sort of running some kind of you know da algorithm where it's sort of decoding okay these have to be closed okay then if these are closed and this is connected to this then this has to be somewhat close and so you decode uh this that becomes basically a pair-wise kind of distance matrix and then from this rough pair-wise distance matrix. You decode kind of the actual potential structure. >> Interesting. So there's kind of two different things going on in the the kind of coarse grain and then the fine grain optimizations. Interesting. Yeah. >> Very cool. >> Yeah. You mentioned Alpha Fold 3. So maybe good time to move on to that. So the Alpha Flow 2 came out and it was like I think fairly groundbreaking for this field. Everyone got very excited. A few years later, Alpha Fold 3 came out and um maybe for some more history like what was the difference between Alpha what were the advancements in Alpha Fold 3 and then I think maybe we'll after that we'll talk a bit about the uh sort of how it connects to bolts but anyway yeah so after Alphaold 2 came out I mean um you know Jeremy and I got into the field and with many others you know the clear problem that you know uh was you know obvious after that was okay now we can do individual chains can we do interactions, interaction different proteins, proteins with small molecules, proteins with other other molecules and so >> so quick why why are interactions important? >> Interactions are important because to some extent that's kind of the way that you know these machines that you know these proteins have a function. You know the function comes by the way that uh they interact with other uh with other proteins and other molecules. actually in the first place you know uh the machines the individual machines are often as Jeremy was mentioning not made of a single chain but they're made of multiple chains and then these multiple chains interact uh with other molecules to give uh the function to uh those and on the other hand you know when we try to intervene of these interactions think about like a disease think about like a bio sensor or many other ways we are trying to design a molecules or proteins that interact in a particular way with what we would call a target protein or target. Um and so you know this problem after 2, you know, became clear kind of the the big uh one of the biggest problems in the field to to solve. uh many groups including kind of ours and others you know started making some kind of contributions uh to this problem of trying to model these interactions and Alpha 3 was um you know put a was significant advancement on the problem of modeling interactions and one of the interesting thing that uh they were able to do while you know some of the rest of the field that really tried to try to model different interactions separately ly you know how protein interacts with small molecules, how protein interacts other proteins, how RNA or DNA um have their structure. They put everything together and you know train a very large models with a lot of advances including kind of changing kind of some of the key uh architectural choices and managed to get a single model that was able to set a new state-of-the-art performance across u all of these different kind of modalities whether that was protein small molecules is critical to developing kind of new drugs protein protein understanding you know interactions of you know proteins with RNA A and DNA and so on. >> So just uh to satisfy the AI engineers and in the audience, what were some of the key architectural and data changes that made that possible? >> Yeah. So one uh critical one that was not necessarily just unique to Alphaold 3, but there were actually um a few other teams including ours in the field that proposed this was moving from you know modeling structure prediction as a regression problem. So where there is a single answer and you're trying to shoot for that answer to a generative modeling problem where you have a posterior distribution of possible structures and you're trying to sample uh this distribution and this achieves two things. one is starts to allow us to try to model um more dynamic systems as we said you know some of these structures can actually take multiple um multiple structures uh and so you know you can now you know model that you know through kind of modeling the entire distribution but on the second hand from more kind of core modeling questions when you move from a regression problem to a generative modeling uh problem you are really tackling the way that you think about uncertainty in the model in a different way. So if you think about, you know, I'm undecided between different answers, what's going to happen in a regression model is that, you know, I'm going to try to make an average of those different kind of answers that I had in mind. And uh when you have a generative model, what you're going to do is you know sample all these different answers and then maybe use a separate models to analyze those different answers and pick out um the best. So that was kind of one of the uh critical improvement. The other improvement is that they significantly simplified to some extent the architecture especially of the um final model that takes kind of those pair wise representations and turns them uh into an actual structure and that's now looks a lot more like a more traditional transformer than you know like a very um specialized equivariant architecture that it was uh in Alpha 4. So this is a bitter lesson a little bit. >> There is some aspect of a bitter lesson but the interesting thing is that it's very far from you know being like a simple transformer. I think one of um this field is one of the uh I would argue very few fields in uh applied machine learning where we still have kind of architecture that are very specialized and you know there are many people that have tried to replace these architectures with you know simple transformers and you know there's a lot of debate in the field but I think kind of the uh most of the consensus is that you know the performance that we get from the specialized architecture is faster ly superior than what we get through a single transformer. >> Yeah. >> Can can you talk a bit about that like specialized architecture? Um I assume you're referring to triangle layers probably as the core idea or >> there's something uh maybe it's probably quite fundamental about the fact that we sort of model this in like a you know second order. So like instead of just the sequence we model every single pair and then to update every pair then we need to have these like sort of triangular type operations and um I think what's interesting about it is is is is a couple of things like one I think it relates a little bit to what the input is you know we talked about these multiple sequence alignments before and kind of this notion that like um you know we need to look at pairs of residues to try to understand you know maybe this initial like distance matrix like Gabri was talking about um and that's something that is very natural right to to model um in 2D um and and I think also there's something about the output as well I think where I think supervising you know over these pairs I think is also quite powerful you know it's this idea of telling the model hey like these two things are close to one another these two things are not um And doing that I think in in 3D is is maybe a bit more challenging. Um you know when I say 3D sorry I mean like so it's like 1D where we model the coordinates in three dimensions but doing that in like one dimension I think is probably more challenging for the model and and yeah I think to it's it's really survived the test of time. I mean you know this thing came out in 2021 and it's largely the same. I mean there's been this change to the the structure module that's been like largely simplified but where the a lot of the magic happens you know I think it's still it's still in the same place um with these like large like pair-wise interaction modeling >> um that's maybe like the most differentiated portion the other part I think that's in off three uh is sort of this moving away from modeling just at the amino acid level to actually sort of having um the model sort of alternate between um you know sort of atomic resolution modeling and then more like it's called token level which is like at the amino acid level that's also something that was introduced um that I think you know was particularly helpful in like you know modeling these other modalities like small molecule etc and like this idea of like coarse grain like um finer grain is I think that's actually quite popular I think in other areas as well so that's maybe like not too surprising but yeah I think this the fact that you for some reason you you know the models that have so much more inductive bias when you when you go you know into this 2D representation I think is is is very interesting. >> So you you mentioned coarse and fine grain and that brings to mind the sort of ribbony diagrams of proteins that I've that everyone has probably seen. Can you actually pull up a like a molecule and kind of talk about what >> you know what the different components of that protein are we're looking at like with the spiral and the arrows and all those what components and those like what what level of uh granularity are we looking at h like how do we think about that how does a model think about that >> yeah so um there's actually a little image of our of our own bull platform. Um I have a protein here. Um and you actually like sort of see both uh the coarse grain and the finer grain here. So we have the sort of ribbon like structure here that is um you know representing these these different amino acids in the protein. But then like when we zoom in over this like interaction with the small molecule um then you see like sort of this at the atomic level like how these things like you know interact with one another. There's even like the actual like bond interactions uh here that are like shown. Um and yeah like you know we we go from like this very abstract representation of these things you know like the the sequence the graph of the molecule and and the goal is like every single atom should have a coordinate and um and you know ends up looking like something like this. It's actually pretty pretty elegant. I think this is like something that's nice really nice that this field has done is like it's made really beautiful visualizations of stuff which is like really nice to look at and yeah I mean this is this is this is one example >> and so the there's like okay um the there's like ribbons there there's like the coily ribbons there's arrows there's like some sort of like not coily ribbons like what do those mean how does someone think about those >> yeah so um we can zoom into a few different areas of the protein this one's actually a good example because there's a few different secondary structures here. So, um here you have, you know, we call an alpha helix. Um there essentially like sort of three categories. There's the alpha helyses. Um this is where it takes a little bit of like this like ribbon uh shape. Um there's here what we call a better sheet. Um which actually, you know, as the name says, uh sort of like ribbon going like this forming forming a bit of a of a sheet. And then you have um these more like loopy regions uh which look like more unstructured and those are you know the parts of the protein that are most flexible. they are super important. Uh you know maybe like one of the most like canonical you know drug modalities are antibodies and antibodies have like you know six of these loops that are like largely flexible but when they interact you know kind of come into this like fixed structure when interacting with the um you know with with its target. Um so harder to model and really critical to interactions. Um, and yeah, those are largely the three sort of big families. >> Okay. And and as a, you know, as a structural biologist or just a biologist, when you look at that, so you you're basically looking, okay, here's the the sheet part. Here's the and then you're you're kind of saying, okay, that so that'll be bendy and then I have like these coils those like what what do those mean to you when you look at them? >> Yeah, I mean, you know, I I should say I am not a structural biologist by any any way, shape or form. Um but you know there's certain types of interactions that are more canonically associated with these different types of structures. Yeah. >> Um I think uh a more well-versed structural biologist you know could give you a more thorough answer than that. I don't know if you know anything more than I do but yeah um yeah and and and you know like we've seen for example this this maybe related to that point like we've seen um you know some of the early successes of protein design um being able to design a binder you know to to any any target um a lot of the early success was like these like very you know alpha helix centric type peptides which I think are um almost like bricks you know um and the models had like a pretty good understanding of those like kind of interactions and so like there was like good success with that and then took a little bit of time to like go from that to like you know more exotic uh binders and and things like that and so um yeah there's certainly a lot of um yeah a lot of important um interaction behaviors associated with with these structures. Yeah, >> another interesting thing that I think on the staying on the modeling machine learning side which I think is somewhat counterintuitive seeing some of the other kind of uh fields and applications is that scaling hasn't really worked kind of the same uh in this field. Um now you know models like alpha 2 and alpha 3 uh are you know still very large models but at the same time they in terms of parameters they're actually not very big. they are definitely below a billion parameters. You know, if you hear these days in LLM space, you know, a model with less than a billion parameters, you would think can do anything. But on the other hand, when you look at the computational cost of running these models, they are actually a lot more expensive than uh it is to run a language models because as Jeremy was saying, we go from instead of having sort of like quadratic operations, we now a cubic operation and and so it's interesting how right now in the field and and this is maybe related to you know having kind of less data or you know needing more inactive biases but we have um this ratio of you know amount of computation to parameters that is much much higher than in other in other places. >> Yeah, if I recall Alpha 2 was like what 70 million parameters something like that. >> Um yeah it's it's something like that. It's quite yeah it's quite small around 100 or so. Yeah. >> So like these these decisions of triangle layers and like these for alpha 2 this like interesting equavarian architecture like really were priors that it baked in a lot of the physics of the system and also co-evolution data is I think people have argued that is kind of like almost like a database lookup of some sorts. It also sort of so that provides in some sense more parameters as well. Yeah, I mean it's uh it's more definitely the amount of like you know pure like compute flops, right? Is is very high and it's almost like more yeah more almost more like reasoning based maybe than like more just like information extraction. You know, I think one of the things that the part of the reason the LMS are so large isn't just because of their reasoning capability, but also because of like like the sheer quantity of information that they store. And I think here there's a little bit less of that, you know, and I think it's more about like, you know, decoding this input rather than maybe like memorizing as much of it. >> So is there like a loop in the architecture that allows it to compute more for per parameter? Like how does that work? Part of it is just you know exclusively this fact that instead of you know having operations that operate on the uh on the single chain they operate on the pair wise and so you instead of having like quadratic number of kind of interactions you have a cubic number of interactions and so that on its own you know leads you to have you know smaller kind of representation sizes but more representation that leads to more flops but fewer parameters. On the other hand, you know, there is actually also this idea of, you know, they somewhat similar to to reasoning where you recycle kind of this operation. from Alpha 4 2 but also kind of Alpha 4 3. They have this interesting framework where you know you start we as we were discussing kind of the input to the model is sort of like this initial understanding of the interactions either from the evolution of the multiple sequence but also potentially from what we call templates that are basically database lookup of similar structures. And so how the model works is that you know it decodes these and tries to understand a good you know potential rough structure of the pair wise interaction and then what you can do is basically do this recycling where you feed this uh kind of understanding back to the input of the model and then try to decode it again and people do this you know three or four times and you know in some cases you know I've even tried to do it uh tens of times and so you can see it as a very very early version of kind of reasoning uh or you know trying to uh to get. >> Yeah. So you you know uh Alpha 2 really cool, Alpha 3 really cool. Um but Alpha 3 came with a catch and I think this catch was important for the development of you know bolts and so on. So >> yeah the catch was that it was an amazing paper nature uh paper but unfortunately they uh decided not to release the model. uh you know Alpha Fall 2 uh was open source and since then was was used I think the the reported numbers is you know more than a million scientists. Alpha for free for you know commercial reasons that you know um did mine has since spin-off as a morphic lab that is now trying to become sort of like a new pharmaceutical company uh and decided to keep this model internal and and only use internally and now uh both you know we were in the field and you know building on top of models like Alphafold and so now we no longer add you know kind of the base starting point uh to build on top but even more importantly everyone in uh both kind of academic research and in industry no longer had access to these incredible models that you know was you know really useful to try to understand um kind of biologies but also try to develop new therapeutics. And so we um decided that you know to to take the the matter in our own hands and decided to kind of um try to obtain a model that was of similar accuracy. And so largely also you know using a lot of you know the uh information that was in the alpha free manuscript we went ahead and built boltzswan which was um the first fully open source kind of model to approach the the level of accuracy of of our fault 3 and you know along the way and and you know uh we can talk about it more but you know we realized that you know it was probably too ambitions to have you know to see this as a an academic project and you know there are a lot of things that were kind of missing and so um we decided to also start a public benefit company to push kind of this this mission of you know democratizing access to these models that we started with bolts one >> quick interjection I mean I remember this it was actually shocking how fast you got bolts one out like it was just like two or three months right >> I think we started in late May and it came in November if I remember correctly. So slightly longer but yeah. Yeah, it was relatively quick. I mean for what it's worth like >> you know we were working on some of the some similar ideas at the time. I think like we you know for example this idea of like having a diffusion model on top of um this like more this pair wise strong was something that we were we were exploring independently. Um now when the paper came out it was like really clear like especially for example on the data pipelines there was like so much that we were like not really doing and so um there was a lot to like catch up on. Um but we were already in a place I think where we had you know some experience working in you know with with the data and working with these type of models and I think that put us already in like a good place to you know to to produce it quickly and you know and I would I would even say like I think we could have done it quicker. The problem was like for a while we didn't really have the compute and so we couldn't really train the model and actually we only trained the big model once. Uh that's how much compute we had. We could only train it once and so like while the model was training we were like finding bugs left and right. Uh a lot of them that I wrote and like I would I remember like us like sort of like you know doing like surgery in the middle like stopping the run, making the fix like relaunching and um yeah we never actually went back to the start. We just like kept training it with like the bug fixes along the way. Uh which was >> impossible to reproduce now. >> Yeah. Yeah. No, that model is like has gone through such a curriculum that you know it's learned some weird stuff. Uh but uh yeah, somehow by miracle it worked out. >> The other uh funny thing is that the way that we were training most of that model was through uh a cluster from the department of energy, but that's sort of like a share cluster that many groups use. And so we were basically training the model for 2 days and then it would go back into the queue and stay a week in the queue. And so it was it was it was pretty painful. And so we actually kind of towards the end um I caught up with with Deon the CEO of of Genesis and and basically I was telling him a bit a bit about the project and you know kind of telling him about this frustration with the computer. And so luckily, you know, uh he offered to kind of help and so we uh we got the help from Genesis to, you know, finish up the um the model otherwise it probably would have taken a couple of extra weeks of weeks. >> Yeah. >> Yeah. Bolt one. How did that compare to Alpha Fold 3? And then and then there's some progression from there. >> Yeah. So I would say kind of the bolts one but also kind of these other kind of set of models that came um around the same time were kind of approaching were a big leap from you know kind of the previous kind of open source models uh and you know kind of uh really kind of approaching the level of alpha 3. But I would say still say that you know even to this day there are you know some specific instances where uh alpha 3 uh works better. I think one one common examples is antibbody antigen uh prediction where you know alpha fold 3 still seems to have an edge uh in in many situations. Obviously these are somewhat different models. They are you know you run them you obtain different results. So it's it's not always the case that one model is better than the other but kind of in aggregate we still uh especially at the time so 3 is you know still having a bit of an edge we should talk about this more when we talk about volt but like how do you know one is one model is better than the other like so you I make a prediction you make a prediction like how do you know >> yeah so the easily you know the the great thing about kind of structure prediction and you know once we're going to go into the design space of designing new small molecule new proteins this becomes It's a lot more complex. But a great thing about structure prediction is that a bit uh like you know CASP was doing basically the way that you can evaluate them is that you know you train uh the model on a structure that was you know released across the field up until a certain time. And you know one of the things that we didn't talk about that was really critical in all this development is the uh PDB which is the protein data bank is this um common resources basically common database where every uh biologist and uh publishes their structures and so we can you know train on you know all the structures that were put in the PTB until a certain date and then we basically look for recent structures. Okay, which structures look pretty different from anything that was published before? Because we really want to try to understand generalization and on this new structure we evaluate all these different models. >> So you just know when alpha 4 was three three was trained, you know when you're you intentionally train to the same date or something like that. >> Exactly. >> Right. Yeah. >> And so this is kind of the way that you can somewhat easily kind of compare these models. Obviously that assumes that you know the the training set >> you've always been very passionate about validation. I remember like diff doc and then there was like diff do l and dogen like you you really thought you've thought very carefully about this in the past like um yeah I mean actually I think dogen is like a really funny story that I think um I don't know I don't know if you want to talk about that it's an interesting like uh >> yeah I think one of the amazing things about putting things open source is that you know it um we get a ton of feedback from from the field and you know sometimes we get kind of great feedback of people really liking the model. But honestly, most of the times and you know to be honest that's also maybe the most useful feedback is you know people sharing about where it doesn't work. And so you know at the end of the day it's critical and this is you know also something you know across other fields of of machine learning it's always critical to set uh to do progress in machine learning set clear uh benchmarks and you know as you know you start you know doing progress of certain benchmarks then you know you need to improve the benchmarks and make them harder and harder and this is kind of the progression of you know how the field operates and so you know the example of of uh doctrine was you know we um published this initial uh model called diff do um in my first year PhD which was sort of like you know one of the early um models to try to predict uh bio kind of interactions between proteins small molecules um that we about a year after alpha 2 was published and now on the one end you know on these benchmarks that we were using at the I am uh diff was doing uh really well kind of you know uh outperforming kind of some of the traditional physics based methods but on the other hand you know when we started you know kind of giving these uh tools to kind of many biologists and uh one example was uh that we collaborated with was the group of Nick Pitzy at Harvard. uh we not started noticing that there was this clear pattern where for proteins that were very different from the ones that we're trained on uh the models was was struggling. And so you know that seemed clear that you know this is probably kind of where we should you know put our focus on. And so we first developed you know with uh Nick and his group a new benchmark and then you know went after and said okay what can we change and kind of about the current architecture to improve this uh pattern and generalization and this is the same that you know we uh we're still doing today you know uh kind of where does the model not work you know and then you know once we have that benchmark you know let's try to uh throw everything we uh any ideas that we have at the pro >> and there's a lot of like healthy skepticism in the field which I think you know is is is great and I think you know it's very clear that there's a ton of things the models don't really work well on but I think one thing that's probably you know undeniable is just like the pace of pace of progress you know and how how much better we're getting you know every year and so I think if you you know if you assume you know any constant you know rate of progress moving forward I think you know um things are going to look pretty cool at some point in future was only 3 years ago >> yeah I mean it's wild like >> what? >> Yeah. Yeah. Yeah. It's one of those things like even being in the field, you don't see it coming, you know, and like I think Yeah. Um hopefully we'll, you know, we'll we'll continue to have as much as we've had the past few years. >> So, this is maybe an an aside, but I I'm really curious. You get this great feedback from the from the community, right, by being open source. Um my question is partly like okay yeah if you open source then everyone can copy what you did but it's also maybe balancing priorities right where you like all my customers are saying I want this like there's all these problems with the model yeah yeah yeah but that like my customers don't care right so like how do you how do you think about that >> yeah so I would say a couple of things one is you know part of uh our goal with bolts and you know this is also kind of established as kind of the mission of the public benefit company that we started is to democratize the access to these tools. But one of the reason why we realized that Boltz needed to be a company it couldn't just be an academic project is that putting a model on GitHub is definitely not enough to get you know chemists and biologists you know across you know uh both academia, biotech and and pharma to use your model to uh in their therapeutic programs. And so a lot of what we think about you know at Bolts beyond kind of the just the models is thinking about all the layers that come on top of the models to get you know from you know those models to something that can really uh enable scientists uh in the industry. And so that goes you know into building kind of the right kind of uh workflows that take in kind of for example the data and try to answer kind of direc

Original Description

This episode traces the remarkable journey from AlphaFold2’s landmark achievement in protein structure prediction to the broader landscape of molecular interaction modeling and protein design. The problem AlphaFold2 addressed—predicting the structure of single-chain proteins—was long considered intractable due to its perceived NP-hard nature. The breakthrough came not only from advances in machine learning but also from leveraging evolutionary data to infer co-evolution of amino acids, providing powerful hints about spatial proximity in protein structures. Yet, as the guests explain, the field quickly moved beyond this milestone toward more complex questions, like how proteins interact, how they fold dynamically, and how to model these interactions with small molecules, RNA, and DNA. AlphaFold3 marks a critical shift in this evolution, moving from static structure prediction to modeling heterogeneous molecular interactions. Rather than treating these interactions as isolated problems, AlphaFold3 unifies them within a single model trained across modalities. This progress also reflects a broader trend in machine learning: the shift from regression-style prediction to generative models capable of expressing uncertainty and capturing system dynamics. By sampling from a distribution of plausible structures and interactions, these models allow researchers to better understand the flexibility and variability of biological systems. However, such models also introduce new challenges, particularly around validation and ranking of generated outputs. Enter Boltz and its suite of tools, which aim to democratize access to these cutting-edge capabilities. Boltz builds on open-source principles and a strong community foundation to deliver models that are both state-of-the-art and accessible, with a focus on usability, extensibility, and real-world validation. Boltz2 and BoltzGen combine structure prediction, affinity estimation, and generative design in one pipeline, enabling use

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Latent Space · Latent Space · 0 of 60

← Previous Next →

Ep 18: Petaflops to the People — with George Hotz of tinycorp

Ep 18: Petaflops to the People — with George Hotz of tinycorp

FlashAttention-2: Making Transformers 800% faster AND exact

FlashAttention-2: Making Transformers 800% faster AND exact

RWKV: Reinventing RNNs for the Transformer Era

RWKV: Reinventing RNNs for the Transformer Era

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

Generating your AI Media Empire - with Youssef Rizk of Wondercraft.ai

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

The End of Finetuning — with Jeremy Howard of Fast.ai

The End of Finetuning — with Jeremy Howard of Fast.ai

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The AI-First Graphics Editor - with Suhail Doshi of Playground AI

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Accidental AI Canvas - with Steve Ruiz of tldraw

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Origin and Future of RLHF: the secret ingredient for ChatGPT - with Nathan Lambert

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

The State of AI in production — with David Hsu of Retool

The State of AI in production — with David Hsu of Retool

Building an open AI company - with Ce and Vipul of Together AI

Building an open AI company - with Ce and Vipul of Together AI

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI

Making Transformers Sing - with Mikey Shulman of Suno

Making Transformers Sing - with Mikey Shulman of Suno

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Why Google failed to make GPT-3 -- with David Luan of Adept

Why Google failed to make GPT-3 -- with David Luan of Adept

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Personal AI Meetup - Bee, BasedHardware, LangChain LangFriend, Deepgram EmilyAI

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Breaking down the OG GPT Paper by Alec Radford

Breaking down the OG GPT Paper by Alec Radford

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

This World Does Not Exist — Joscha Bach, Karan Malhotra, Rob Haisfield (WorldSim, WebSim, Liquid AI)

LLM Asia Paper Club Survey Round

LLM Asia Paper Club Survey Round

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How AI is Eating Finance - with Mike Conover of Brightwave

How AI is Eating Finance - with Mike Conover of Brightwave

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

State of the Art: Training 70B LLMs on 10,000 H100 clusters

State of the Art: Training 70B LLMs on 10,000 H100 clusters

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

Training Llama 2, 3 & 4: The Path to Open Source AGI — with Thomas Scialom of Meta AI

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Synthetic data + tool use for LLM improvements 🦙

Synthetic data + tool use for LLM improvements 🦙

RLHF vs SFT to break out of local maxima 📈

RLHF vs SFT to break out of local maxima 📈

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

The Winds of AI Winter (Q2 Four Wars of the AI Stack Recap)

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Segment Anything 2: Memory + Vision = Object Permanence — with Nikhila Ravi and Joseph Nelson

Answer.ai & AI Magic with Jeremy Howard

Answer.ai & AI Magic with Jeremy Howard

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Building AGI with OpenAI's Structured Outputs API

Building AGI with OpenAI's Structured Outputs API

Q* for model distillation 🍓

Q* for model distillation 🍓

Finetuning LoRAs on BILLIONS of tokens 🤖

Finetuning LoRAs on BILLIONS of tokens 🤖

Cursor UX team is CRACKED 💻

Cursor UX team is CRACKED 💻

Choosing the BEST OpenAI model 🏆

Choosing the BEST OpenAI model 🏆

How will OpenAI voice mode change API design?

How will OpenAI voice mode change API design?

STEALING OpenAI models data 🥷

STEALING OpenAI models data 🥷

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] 🍓 On Reasoning: Q-STaR and Friends!

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

[Paper Club] Writing in the Margins: Chunked Prefill KV Caching for Long Context Retrieval

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

The Ultimate Guide to Prompting - with Sander Schulhoff from LearnPrompting.org

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Prompt Engineer is NOT a job 📝

Prompt Engineer is NOT a job 📝

Prompt Mining LLMs for better prompts ⛏️

Prompt Mining LLMs for better prompts ⛏️

The six pillars of few-shot prompting 🔧

The six pillars of few-shot prompting 🔧

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Can you separate intelligence and knowledge?

Can you separate intelligence and knowledge?

This video teaches the advancements in protein structure prediction and molecular interaction modeling, from AlphaFold2 to the broader landscape of protein design and drug development. It covers the tools and techniques used in this field, including AlphaFold, AlphaFold 2, AlphaFold 3, and Multiple Sequence Alignment (MSA). The video also discusses the challenges and limitations of current models and the future directions of research in this field. By watching this video, viewers can gain a deep

Key Takeaways

Build protein structure prediction models using AlphaFold and other tools
Design proteins with specific functions using machine learning models
Develop machine learning models for molecular interaction modeling
Apply machine learning models to molecular interaction modeling
Use Multiple Sequence Alignment (MSA) to analyze protein sequences
Use diffusion models and retrieval augmented generation to improve model performance

💡 The use of machine learning models, such as AlphaFold and AlphaFold 2, has revolutionized the field of protein structure prediction and molecular interaction modeling, enabling the design of proteins with specific functions and the development of new drugs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

Detect Claude AI Code Marking: Why Chasing It Is a Distraction

Learn why directly detecting Claude AI code marking is a distraction and how to protect AI agent data integrity in Node.js

Dev.to · Umair Bilal

Stop Overpaying for AI APIs

Learn to optimize AI API costs by identifying key drivers of expense and implementing cost-saving strategies

I Managed AI Agents Like Junior Hires for a Month - Here Are the 4 Manager Moves That Don't Transfer

Managing AI agents like junior hires can lead to unexpected issues, and certain traditional management moves don't transfer to AI agents, highlighting the need for new strategies

Multi-Agent Systems in Production: When One Agent Isn't Enough and How We Coordinate Them

Learn how to coordinate multiple agents in a system for efficient task handling and decision making

Dev.to · Lycore Development

Building Great Agent Skills: The Missing Manual