Recursive Reasoning with Tiny Networks
Key Takeaways
The video discusses the paper 'Less is More: Recursive Reasoning with Tiny Networks' which explores using a model recursively to solve reasoning problems on small experiments, utilizing tools such as Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM).
Full Transcript
Hi, I'm Alfonso Ggera, founder of Apocalypse Software. Uh we do productivity tools. Um but uh this is the West Coast machine learning group and we're discussing a paper um less is more recursive reasoning with tiny networks by Alexia Matnau. Sorry for butchering your name. um and uh it focused on a a new technique uh for um for reasoning uh for machine learning. Um it is based off of another one called hierarchical reasoning model uh which um as it says in the abstract uses two small networks uh and I'm just read part of the abstract recursing at different frequencies um so that uh instead of creating a large uh uh network uh memory intensive network uh that has been that takes a lot of time to um generate. Um it um creates a very small model and what it does is as it uh instead of using the large uh a large network to uh provide reasoning um something that has be precomputed, it will on the fly sort of generate uh the the answers that it needs by um uh repeatedly uh uh recursing into the network and and uh refining its uh its reasoning uh as it goes along. Um so what this paper presents is a derivative a derivative of that called the tiny recursive model which goes even further along that line uh because they believe that there are some uh optimizations that can be done off of uh the hierarchical model uh we'll call HRM um and uh in the paper they call theirs the the tiny recursive model or TRM Um so uh with that they're they're claiming that um a simpler recursive reasoning approach uh does even better than the HRM with a a tiny network with with two uh layers and 7 million parameters. Um and they they claim uh a better accuracy on on several of the um benchmarks that are are typically used uh for uh large language models and and other um uh machine learning uh models. So, uh, as they go, um, they they first talk about the, uh, certain drawbacks that, uh, LLM have, uh, when they're answering, uh, certain types of questions. And so they they incorporate, um, other techniques such as, uh, uh, chain of thoughts um, and test time compute. And so uh they they describe how these processes work uh and and um other things that must be done on top of the LLM um architecture in order to get uh better results uh for the benchmarks. Um and so uh they they bring up the uh the model that was proposed uh in another paper um uh through their uh HRM um which yielded better results on a smaller model u because it depended on two primary um uh techniques which they they added. Uh one was uh recursive hierarchical reasoning uh which basically iterates through these models uh repeatedly uh and it uses certain uh hyperparameters to allow uh the operator to to in uh specify uh what their uh work would would be um or or or how it should operate. And um they in their paper um they present two uh uh notations um L for the high frequency and FH for the low. Um to uh determine how to uh iterate through the through these networks. Um and that each network uh generates two outputs uh ZH the high frequency one and ZL the low frequency um and those are uh then used as inputs to the the network again. So um the authors of the HRM paper uh uh present uh that their model was based on reasoning uh based on biological how biological models uh uh process information. So um they're basically trying to um build a technology on top of u an architectural a a biological uh model. Um and so uh the other technology that the HRM paper uh presents is a a deep supervision um in which and and this I'm a little fuzzy on uh they uh they the outputs that are generated uh through the uh through these steps uh recursing down the the layer uh recursively iterating uh through the model um with it it basically reviews uh their work and and does some corrections or or guidance uh calculations to yield better results. And these two um features on top of the the their tiny models uh uh got better results. Um and uh in this paper they talk about how a uh another um analysis from a uh a separate um group um analyzed the um the the HRM and determined that um the deep super supervision was the um the main way that that it was able to uh that the HRM was able to uh improve uh its results uh visa v the the LLM. Um and they talk about how uh the the accuracy for uh one um benchmark went from 19 to 39% accuracy and the other uh technology presented uh in that paper the recursive hierarchical reasoning was only slightly better. um they because they tested them independently. Um so but in this paper they determined that um the uh recursive reasoning could be uh uh improved substantially uh with some other techniques. And so they they present the tiny recursive model um with with an even smaller with even smaller uh uh uh models or network than the uh HRM paper does. Um and let's let me scroll forward here uh just so that we see this was the uh the pseudo code for the uh for the HRM model and um begins here uh in the training and it determines the number of uh this is the part that does the the dupvision. Um, and here's where it calls the the recursive reasoning uh here. But, uh, this supervision is where it determines what where the loss is and and um, uh, processing for that. Um, they're able to get the results on the next iteration to be improved. So, um, let me stop highlighting this one here. Um so the uh in this paper they're talking about uh at this point is where they can do uh yield better performance uh that this one is already significantly um better than typical LLM work and that uh improvements here would be even better. And they they claim that um where is it? Yeah. Uh they claim that they were able to improve the performance on the Sudoku uh extreme from 55% to 87% accuracy over the HRM which was already better than LLMs. Um the same with uh Maze Hard and and other the other benchmarks that are used here. >> Yeah. >> So yes. >> Yeah. A quick question just um did you dive into the deep supervision what exactly that is saying? I'm not sure I have a correct understanding of it. >> Yeah, that's that's the part that I wasn't um too clear. Um here um here's the code pseudo code for it. And um they're they're getting um the data from the the recursive um coding of of the uh of the network um to extract uh an answer. And um then they are um they're determining I guess the uh the loss uh of what uh how far off uh the the answer is from uh what's optimal and then they uh generate is that complete? uh they generate a a a step to to improve it for the for the next iteration um to optimize the the uh the direction that the network is going and then uh or that the model is generating and then that's used for the next step uh through the network. So let me see uh well that this is from the the the deep supervision from the original HRM model and >> right >> in this in this paper they're not specifically addressing how that works. they're they're focusing on the work of of the uh uh recursive um >> okay okay >> reasoning but but in so in this one in the original paper um I didn't I didn't take a look at the original paper um but it just uh it seems that that this is navigating um the uh the results generated through each uh iteration through the >> through the network to to kind of guide it towards where the the correct uh answer would be. >> Okay. So, so >> if you guys want I can I can try and shed a little bit of light on this. >> Yes, >> feel free please. Um so so I think what we want to do is we want to disentangle two different aspects of this. So one this is a recursive transformer formulation. This is not the first recursive model. Okay. So a really really super simple recursive transformer would be you say hey I have this GPT2 model and it has 24 layers. I'm going to replace it with a model that has only one layer, but I'm going to repeat it 24 times. And if you do that, you can get something like 90% the performance of GPT2 at 124th the amount of storage of of weights. So you can't get the same performance, but I was even surprised. I don't really have a mental model how you can even get like 90%. Okay, so the hierarchical recursive stuff that's in that top part has two for loops basically and um and it just has to do with how it repeats its layers in in what order it's repeating them. Okay, so if we just set that aside for a second, then the deep supervision just means we're going to run our model 16 times or whatever. Okay. um during during training we are actually updating the weights each of those 16 times. So it's a little bit just like running 16 epochs sort of right except the huge difference is when you're just training any old model like you know the old CNN's AlexNet or whatever each time you start an epoch you're you're starting from scratch in this case there are um uh there's there's hidden state in there and what they do is they run the 16 in a row on the same data and they keep the state in between each of those runs. So that's why they don't just call it supervised learning. Okay, that's why it's got this deep supervision because if if they were just resetting the state every time, it would be identical to regular old supervised learning. Um but in this case it is different and and so it does kind of require a uh a different name. >> Okay. >> So is it am I understanding correctly then it's kind of um if the number of um iterations is 16 that it's kind of back propagating through those 16 layers. >> But they don't do that as a way of speeding up and simplifying the process. So they do one of the 16 and they do backrop on just that one run and then they do the second of the 16 and then they do backdrop on just that one. Not the first two but just the one all by itself and then they do the third one and do backdrop. So it's really it's 16 separate runs as it were right in terms of the gradients are only dependent on that one of the 16. However, the starting point is not identical for all 16 of them. The starting point is using data from the end point of the previous of the 16. >> Gotcha. >> Is this fair is it fair to say that each of the except for the first one, each of the successive ones are attempting to run based on the error. They're correcting for the error or they are correcting for the original data or they're still running from the original data. >> Yeah. So they are they are not correcting the error in the literal sense the way that gradient boosting you actually subtract your prediction from the the the correct answer and then you say I'm going to predict on that error. Okay. rather the the mental model I think that most people agree on is they're refining the answer from the previous step. So the you know if you're if you're solving Suduku and the previous answer you had a bunch of one twos threes nines and then you say now you're allowed to change some of the numbers. Uh do you want to change some of the numbers to try and make the answer even better? You know um does that does that answer your question? >> Yeah. Yeah. Yeah. So I'm trying to think of a model. So uh uh uh if if you guys remember style GAN um there's other models but but style GAN is a model that had a certain input that was just this constant input um that all the images it generated started with this this this square set of numbers and then as it went through the layers the the input information uh would condition uh uh that and ultimately Ely, you know, with the knobs that you would do on style GAN, you would get the face, you know, male, female, older, younger, color, hair, all that stuff based on the knobs you set, but the input was a constant. When you're training style GAN, that input was learned, but basically you're starting from scratch every time you do a a training instance. Imagine what if instead of starting from scratch, you somehow were able to start with the the tenative answer from the previous run and you inserted that at the beginning of the pipeline instead of just starting with a a noise vector. Um and and so in this way it's it's iteratively refining the answer and it's just a very very different way of calculating than a traditional CNN, a traditional transformer, uh pretty much a traditional anything. >> Ed, do you have any intuition why? Okay. So, I'm what the mental model I'm seeing then is you're it's kind of like we're training one layer at a time going up by preconditioning on the output of the kind of the previous layer. Um is do you have any intuition why that is better or why they selected that rather than just um training the whole stack and just prop back propagating it through the multiple layers at once. Does that make sense what I'm saying? >> Yeah, I think Alfonso will get to in the paper there's some changes they made from HRM to TRM and and they actually include more back propagation but it would be if you have uh an 18 layer model I don't know what the exact math is but an 18 layer model and you repeat it 16 times I don't know what is 16 time 18 but it's it's it's like almost 400. Okay. And so if you have 380 something uh effective layers, then back back propagating through all 300 of them would be kind of expensive from a memory perspective. >> Um and it would be a little bit slow. But I do believe the question you're asking, yes, it would work if instead of doing deep supervision one at a time, they just did all 16 at once. they the model might actually train slightly faster but at a like 16fold increase in GPU memory needed to train the sucker. >> Gotcha. >> And so so they were able to do these experiments I think just on like a consumer GPU. So, >> and um would you say that uh for for the this architecture that these um these models are better for um processing on the edge where it um because it doesn't have the memory requirement. It's basically uh uh building up the the the network as it goes along or computing the network as as it goes along which it is only temporary but it's able to do that um uh I don't know if it would be faster but certainly uh using a smaller network uh than than the regular LLMs would need. So that this is this is basically for mobile devices and other devices at the edge. >> Yeah, I think I think most of the research in recursive transformer architectures is thinking about smaller devices. Uh but again we do need to be careful that um I gave this is more complicated but a super simple thing would just be you have one layer and you repeat it 24 times. Okay. Yeah, >> it's a lot smaller memory wise, but again, it is still just as much compute as your model that has 24 layers. It just you needs less memory because you're repeating the same one 24 times. Uh but it's going to be just as slow compute-wise as that. So, it's not quite the silver bullet when it comes to smaller devices because if you said, "Oh, wow. I figured out a way to take GPT5, which is, you know, we don't know, but it's it's ginormous, right? And if you said, I can just repeat three layers a thousand times and I can get the same performance as GPT5, that would still be an amazing result, but you still couldn't run that on your phone in the sense that it will fit on your phone, but it'll output like one token every 5 seconds because you have to run those three layers a thousand times to get one token. And then you have to run those three layers a thousand times each to get one token. So it it solves part of the problem, but it's not it's not the silver bullet. Yeah. So it it's in some respects more like um compiled versus interpreted code or Python versus C++. uh in that uh with with these recursive models, they're they're having to figure it out every time along the way on on the uh on edge devices on mobile devices uh versus LLMs which have it precomputed to a large extent and it's just uh building the answer from that. >> Yeah. Like like if I told you, hey Alfonso, I wrote a program and I can compute all the factorials up to 10 factorial, but I just hand wrote out a bunch of multiplies, right? And you're like, uh, dude, I can write a for loop and it'll only be three lines of code and it'll do the same thing yours does. We would say that your program is definitely easier to understand, more efficient, whatever. But at the same time, it won't necessarily run any faster than my ugly thing that that just does all the multiplies handwritten. No, >> but is that going to be at uh at the time that you're you're uh creating the inferences from it or is it uh uh is it the difference of the computation or or the the difference of of the uh generating the uh the tokens uh going to be when they're actually querying uh the models. >> Yeah. So, so that's my point is that if you again this is a little bit more complicated, but if you have one layer you repeat 24 times, it basically takes the same amount of time to train it as a 24 layer model. And it takes the same amount of time to spit out a token as a 24 layer model. It's just 24 identical layers instead of 24 unique layers. But the the effective amount of matrix multiplies you need to do is is just the same. >> Okay. >> So, and again, I'm not trying to be overly dismissive. I love that there's research on something that's that's that's different. Um but I think yeah just understanding that recursive the the one really big trick that that people use with recursive and they do talk about in the HRM paper and they talk about in this paper is if you repeat this one thing 24 times there's the possibility you realize hey I actually have a pretty good answer and I don't need to repeat it anymore. And so you can stop early. So you can say, "I'll stop after 4. I'll stop after 10. I won't have to do all 24." That's one of the big potential wins. Um other than that though, it really is just um in my opinion uh uh it's not it's it's not really going to solve your edge device phone problem because it's still you know, going to be just as low. >> Well, it's all one part of it, the footprint part of it. >> Yeah. >> Yeah. So, in that respect, you can have several different models that uh that you have on a on a phone which are application specific rather than having one general purpose one that tries to solve everything. >> Mhm. Yeah, small is good. I mean there's there's definitely a benefit of small um okay >> well um and and this is the their flowchart for the architecture and uh how they do it and and as Ted was saying it keeps local variables of the state um as it's processing and each uh iteration through the through the networks uh the real results where I guess it's uh fine-tuning its answer to come up with the correct um solution here. Um but uh here's their again the the original paper or the paper for HRM um the the ideas behind it uh and why they or or the the way that they recurse through the two frequencies uh which had its origin in their their biological uh model. Um and so as he said it's uh they they start from the initial state they have uh the uh the original state they they start with initialize embeddings from the network and then um through each um iteration each recursive step they will take the results from the deep supervision state uh step and um where is it and then applying it to the uh to the uh recursive reasoning model uh which I already went over that code. Um so they talk about the results from the the original paper and um some of the ways that they they saw that they could um uh optimize that or or what the uh potential um avenues that they could go in instead. And um I guess this is where it's doing the the examination between the steps uh to to yield the I guess the the loss on it and and uh how it's going to redirect the parameters it's going to to get for the next iteration. Let me see. Okay. So, so they're using gradient uh approximations instead of uh back propagation through time uh in HRM. And so they basically gave the overview of how HRM did its job and and how it worked. So then they looked at that and they said again what what they were going to address in their tiny recursion model. Um, okay. Um, so they talk about some of the uh decisions that HRM paper took. Um they only back propagate through the last two uh recursions in the in uh in their model. And as Ted said, when they found that they were basically at a decent point in their answer, um they they can stop processing and generate the tokens. Um and this papers exactly Is is anybody familiar with that that uh implicit function theorem kind of what that means how that works? >> I I don't know the theorem well but the idea is um is in a in a recursive neural network. Okay. um what people are are thinking is h is is happening is that the the model is slowly converging on an answer. So it's a little bit like um um what do you call those in differential equations like attractors or whatever where you have a fixed point where um where once you go there then you just don't leave you just stay there. Okay. Um and so so the idea is that if you're if you build something the these they wanted to make them really fast. they actually didn't repeat that many times. Okay, so if you see two, three, six, relatively small numbers. But in other cases where people have tested recursive models, maybe they'll repeat them 10 times, 20 times. Okay, at that point the difference between 20 times and 100 times, it's still in the general genre of a lot. And so if you had some computation that wasn't converging to a fixed point, if you were just like adding two every time, then the difference between 20 and 10 and the difference between 20 and 30 would be huge. And so so they say like these things can only work if they're sort of slowly converging to a fixed point. Um and that's this I don't know the exact theorem, but that's the idea here. So, so the idea is that if um if you if you give it an I I I can't think of a great answer, but like if you if you wrote an old program that's like approximating the value of pi and you give it an initial guess and your initial guess was one, okay, and then it outp points 3.1. But if your initial guess was two, then maybe you know it outputs 3.13 or something like that, whatever, right? Like if you just let it run longer, it might have gotten a little bit closer. But basically, if you were doing back prop, um you wouldn't necessarily have to back prop through all the steps that the that the model gave. You might just simply say, well, all I really care about is the last answer was was 3.1, and I'm going to compare that to my ground truth label, and I'm going to back prop based on on on that last little part. Okay. So I think that's a a super high level intuition for this. Um and the whole point of this was to save memory and to save compute. Um and this is where um she's saying in this paper, you know, since you ran it for so few iterations like two, I I don't think you could make the assumption that it has approximately converged to that fixed point. So, we're going to do what you asked about, Roger. We're we're just going to do back prop through time on all of it. Um, instead of instead of just doing the last step and assuming that that's a a reasonable approximation that we're already pretty close to our final destination. >> Thanks, Ted. I I think I followed that now. is where she's saying that here. While the application of the IFT theorem and onestep gradient approximation to HRM has some basis since the res residuals do generally reduce over time a fixed point is unlikely to be reached when the theorem is actually applied. Uh so she shows uh they bypass the IFT and one gradient and and so don't they don't have to deal with it or don't make that question part of the of their model. They just go ahead and do all the processing anyways. Um so um she she then discusses how uh they use uh adaptive computational time or ACT um to optimize uh the compute on the data sample And that's that's shown in this code here. Uh oh my god, I need well more screens. Um these two functions handle the act part of it. And wait a second, I passed. And so uh then they talk about um certain drawback to using ACT. Um although the cost is not directly shown in the HRM paper. Uh it is in the code and the Q-learning objective relies on a halting loss and a continue loss and the continue loss requires an extra forward pass. through HRM. Um so while ACT optimizes time per sample, it does require two additional four passes per optimization step. Um so she showed that by um with their their model they obiate the need for going the two extra four passes of the ACT. So then she discussed other issues with the HRM uh based on biological architecture um which is not like the uh I guess current research on uh artificial uh networks. Um so uh in this paper they say that uh that's not really the direction that they want to go in. Um so they're going to uh stick with other um techniques that are currently used I guess in in RNN's and and others. um and so is does not try to add the extra steps that are used for uh to to model the uh the biological um what we understand to be the biological system. Um so uh and here's the pseudo code uh for that that she presents for um for the the TRM um and as you can see the the code for the deep supervision uh slightly smaller um uh although it does some computing uh through other um other processing that the uh HRM version uh doesn't do uh only these functions. Um but uh the the part that does the recursive is is two-step uh where the first iteration is done in here. Well, uh let's let's actually go to what they presented. Um it um so in the paper in this one she uh presents that the the TRM does not uh require any more sophisticated uh understanding of how the model works. uh incorporating biological and other um ideas uh into the work. um just generalizes um the the idea of from HRM and uh tries to to reduce it and simplify it to um um to be smaller and um possibly less comput intensive and uh less uh theoretical. um and uh substitutes uh a single pass for the act instead of the two um and it's presented here in this algorithm uh the code here um and because it is simplified uh the model can can be smaller um and yet yield uh better result. Um, so she talks about how uh because of the assumptions that that the paper uh that the HRM paper has about how uh certain uh aspects from the biological model are are going to work um because of those assumptions. They they code their their processes um specific way to model that and and to require certain computations. But here because they're simplifying the the ideas that they're using um they're just uh depending on on I wouldn't say classical but uh regular uh artificial neural network ideas um with fully back propagating through the um through the network um and Um with the deep super deep supervision presented here, um they uh take the the results from the um the recursive reasoning um values and um I I guess generate the the the the loss functions off of that from that to determine where the uh what corrective action should be taken and then uh uses that as input to the next uh loop. Um so uh so instead of gradient approximation they're just uh recursively descending through the model. Sorry uh recursively descending through the model to um yield the the results. Is that >> a can I ask a question then on that? So this this figure three the code that you're looking through. So is the latent recursion that first function there's a loop there I in the range of n um is that the like the number of um times the number of recursions that you have in this small model and then it's wrapped in this other one deep recursion which is also got a loop for J in the range of t minus one. So is it like it's it's saying okay without the gradient we're going to we're going to run multiple times trying to to improve improve improve and then once we have the improved model that's presumably converge because of those those tus one loops that will then do back propagation through the however many layers you did in the the latent recursion. Uh yeah, I wish I had uh a way to show both of these. Hold on, let me try this. Well, no, it's I'd have to really share the screen. Um so here, uh where they're they're generating the the recursive part uh of the reasoning um and all of that is done inside here. Um they it looks like uh in HRM it splits the the uh computation between these these between deep reasoning here uh where some of the uh the output uh extra steps are done here and the the recursion uh solely within here within this part. Um or actually it it here no um here they're actually just doing the the full processing. It's not recursing any further. Um I guess all of that's doing done in here. But if I could if I cut cut and paste and put these two next to each other, we could um let's take a screenshot. >> Yeah. So, so while you're doing that, um, I actually found the HRM paper so confusing and then I understood it a lot better when I read the explanation of HRM at the beginning of this paper. Um, but when we get into TRM itself, um, I think she does a good job of naming things in a way that's far more intuitive. Instead of this high, low, whatever business, it's it's our preliminary answer why, right? or yhat or whatever you want to call it, right? And so we have our input problem X, we have our preliminary answer Y. And then the one thing that I would improve upon is she talks about this latent Z. I would just refer to that maybe in a more everyday language. Z is your goal. Okay. So the model does a loop and it says, let me refine what is my goal. what what problem am I even trying to solve here? And then once it has made its best guess after several iterations of what the goal should be, it then says, well, if that's the goal, how should I change my preliminary answer to this problem? And then it can modify the why a little bit. And then it and then it repeats both of those several times. And then it says, okay, so now let me rethink some more. It do I have the goal right? Okay, based on my most current answer of what the goal is, let me update my answer. And so, um, so you can do that whole business, you know, however many times in your inner repetition, okay? And so, um, and and it's confusing because these are recursive. Okay? So let's say maybe you I don't remember what the numbers are. You refine your goal three times. You run that layer three times in a row and you update your guess six times. You can replace latent recursion and deep recursion with just an 18 layer model if it's three and six. Okay. So you can just say we did deep supervision on an 18 layer model. In this case, of course, as we talked about at the beginning, you save a bunch of memory by having only actually sort of this one model that you use for all of those 18 different steps. But but nevertheless, it is somewhat similar to just using an 18 layer model. So Ted, am I following correctly then? That that that the part here in deep recursion with torch.net, that's where it's trying to figure out what is the goal. And then when we get down to the line underneath, it's actually doing the latent recursion, which is with gradients. So the model is now learning how to get to that goal um in essentially in one pass. in one pass. In this case, you could replace latent recursion with just a model, right? >> Yeah. With just in this case, I think a three-layer model that updates your goal. >> Right. Right. >> Yeah. Because I think they ended up choosing she ended up choosing three. But but the the code that's actually not that long, but the code is actually a little bit harder to read just because of the choice of what what iterations you do backrop on and what iterations you don't do backdrop on. So what you'll see in the deep recursion, so this is the answer recursion. This is the answer repeat refinement loop. It's actually doing it capital T minus ones without back prop and then it does it once. So you'll see that there's actually two different calls to latent recursion that are two lines apart. >> Yeah. Yeah, I see. I see that. Yeah. >> So it's actually calling latent recursion t times. Only the last one is going to be used in backdrop but it is calling it big t times toll and then latent recursion itself has little n as its loop counter. So I believe uh they settled on like big t is six and little n is three for the harder problems like the ark. Um and so in essence the amount of compute you're applying to this is roughly equivalent to an 18 layer transformer model. Sorry the latent recursor is a two layer model. So actually it's 18* 2. So, it's actually a 36 uh layer model. One of the things that's confusing is if you look at figure one or whatever the diagram, it actually says it's a four layer model. And that's a typo. They had experiments where they did four, but they settled on two. So um but so this is where I was saying in terms of compute you do you do uh 6 * 3 * 2 layers right 36 layers and then if you do this deep supervision 16 times then it's 16 * 36 layers all told being computed um in order to get an answer. Okay. >> Yeah. I don't know what what my screen is showing right now, so I don't know if you you're seeing both code samples. >> Yes. >> Yeah, it's great. We we see we see the screenshot to the right of the the actual TRM code on the left. >> Okay. So, yeah. So, um yeah, as uh as Ted was saying, I can't highlight it now because then it'll make this code disappear. But, um the uh the latent uh recursion is called here through the loop and then one more time uh before exiting uh to I guess better uh close the signal in on on uh where you want to get to on Y and Z. Um but so what they're doing is so you have in in the original HRM paper they had deep supervision uh because it was going uh within the uh over each um recurse uh through the uh the reasoning uh step. And here they're calling it deep recursion because they're they're calling it uh it's being uh invoked through two steps um through the networks uh to get their results. So it's in this one I guess they're they have both deep supervision and deep recursion. But uh and then again the illustration for the model It's the architecture. And here Ted was talking about the ROM where there was four times, but it's actually two uh or each step done through the network. But this so this is the the the deep recursion part and this is the deep supervision and Z and Y is what they're looking to improve. Right? So so if it helps the the recursion parts have nothing to do with how the model learns. That just has to do with the architecture. That's a choice to reuse layers instead of having individual separate layers. I'm slightly oversimplifying. Okay, but but if you're having if you're struggling trying to follow this, it's best to just think about recursion as how do you build up your calculations? Just like I said, you could calculate factorials with a for loop or you could calculate factorials with a bunch of handwritten multiplies. They in the end do the same job. It's just how you chose to architect it. Okay. So, um, so the deep supervision has to do with the training and to a certain extent you can say the deep recursion is just if you had 18 layers, how did you decide to sandwich them and build them up into this 18 tall 18 decker burger, you know? Let's see. Okay. So, what I wanted to see is Okay. So, and hey, one question I had uh maybe a silly one. Is it deep repetition or is it deep recussion? Really? Go ahead and use the word repetition. Yeah. Because because I I mean there is one sense where recursion maybe helps you think about what's happening here. Um but but repetition is a very good word uh uh so that you don't get super confused. >> Yeah. What really throws me off is what is Q head over there? I understand Z as some kind of latent representation, but what is a Q doing there and it's also playing a role in early stopping. I just don't get that. >> The Q is only there for early stopping. It's it's completely not necessary for this model. Um, so I'm gonna I'm looking at the time. We're getting a little close to the end. So I'm going to add a few more extra editorial comments. In my opinion, maze solving and suduku are actually relatively easy problems. Okay, they're they're not trivially easy, but don't forget that this model when it was trained for maze solving was trained only to do maze solving. It could do maze solving and nothing else. when you're using GPT5 or claude or whatever, you're taking this general purpose model and just saying here are the rules for Suduku. See if you can solve this particular problem. Okay, so it's really very very apples and oranges. If you built a custom network just to solve mazes, I think you and I could build something. It might not be as fast, but we could build something that could just basically do a breath first search and solve the maze. And so, so it's not clear to me that HRM or TRM do anything smarter than breath first search. They probably do, but it's not 100% clear that they do anything better than that. So, what happened was when they were training these things, imagine you or I just training this on like a 5090, okay? um they're like, "Oh, this thing's going to take five days to train." But actually, if a instead of 16 loops, if it actually has the right answer after only four loops, it's got the solution, I could cut my time in in 1/4th by just stopping when I already have the solution. And so that's the reason for the Q is it dramatically sped up their training by like 4x because on some of these easy problems it got the answer long before it did the 16 loops. So it's like a plus in terms of it trained faster, but for me it's a little bit of a red flag. It goes to show how easy the problems they were solving. >> Yeah. Um, so just to finish off, if you completely got rid of the Q part, it would always run 16 loops and the model would work the same. It would just take be slower because it always runs 16 loops. >> Yeah. No, it's fine. I I think I understand the how Q is used, but I'm just not able to understand how Q is derived. like even if I go through the code and the architecture of the model itself I don't see a place where we are getting Q out >> I think once we get >> just a linear layer it was just it was just a very simple prediction I think >> got it that's okay we can move on some detail before. Thank you. Oh, oh, oh. Gotcha. Gotcha. Okay. Yes. Figure three is pseudo code. They don't actually show you all the code. There's a little bit more code that they haven't shown you. That's why you can't tell where Q head comes from. I think it's a linear layer. >> Got it. Yeah. Okay. >> You can go to the repo and you can see. But it's very very simple and fast. >> Understood. Got it. Got it. Yeah. Thank you D. And there's one other comment that they make in the HRM site 2.5 section saying that uh the 16 times is run not during train time but during test time. did not understand that part because I thought all of this is manipulating loss to adjust the model's behavior and by that measure it must all be during train time. So if you look at the very last they call this ACT it's only used during training while the nup is equal to 16 supervision steps are done at test time to maximize downstream performance. That again threw me off a little bit like >> I don't know what you mean by >> so again this was this was really just because they were training on some easy problems they would sometimes halt the training early but when you're going to get scored on the benchmark you don't want to be 98% sure you have the right answer you're just going to run all 16 loops and that's that's why at test time they did not use their early stop they could have used it but then they might have gotten a score that was like, >> you know, 95% of their final score and and they would have gotten marks for for doing it in a lot less compute. >> But generally speaking, nobody cares about compute. They're just scoring you on accuracy. So in order to get the max accuracy, they they ran it on all 16. And my understanding is some people have posted on Twitter that they modified the code to run it even more than 16. They said, "Hey, you know what? You get an extra percent of accuracy if you run it on 32." And somebody else said, "Hey, guess what? I got another 2% by running it a 100 times. So it's really just a matter of of that uh uh you know compute tradeoff thing. >> Understood. For some >> that's why they always >> Yeah, I was confused because I was thinking train test during the model training process but the test time for benchmarks is an overloading of the word test I guess. >> Yeah. Yeah. Yeah. Test time is a right. It's a funny concept. Yeah. So then she also discuss >> Yeah. Go ahead. >> No, I was just going to say um just point out the time and then figure out what do we want to do with remaining time where you're at. I don't know how how far we are. How much longer do you expect to go? >> O oh my goodness. Uh yeah, we're about just started scratching the surface of what um what the TRM uh theories were uh and what they were presenting. So Evan, if you want to continue it uh next time or another time. >> Yeah, let's keep why don't we keep going till we got another 10 minutes or so here. Um and then if we want to um roll into next week um we can we can decide to do that. >> All right. >> Uh so um uh they present uh again the one of the the drawbacks with the HRM is that they uh they're um they have the two networks that they're using. um the uh the high frequ the low frequency one um and the the high frequency one. Oh, they got a typo here. Um yeah, so that should be FL. Um uh which generates the the highle uh results. Um and so um um so they um in this paper um they said that she said that uh they um replace those two networks with one. Um I guess that's why they do uh the recursion or they they um they have a separate loop for running through uh those results. Um and she claims by doing that that they were able to get even uh better results uh uh uh because and less compute because they um didn't need twice the parameters. Um um so um and she and uh as uh Ted alluded uh they had tried the four layer uh but um found that uh generalization was um was worse because of overfitting. Uh so uh they just dropped it to two. Um while increasing the number of recursions um they they were able to get uh better results in generalization. Um if I could jump in on that point. Um I this is this is some early research. These are sort of proofs of concepts and so um I I think this the research is very interesting and very innovative. Um so take this criticism very very lightly. um in the long run we want models that we can make bigger that will then get better. So we do need to figure out how to prevent overfitting with these models. Um again this is this is light criticism because this is really good research and it's early but I consider this to be an almost fatal flaw of TRM as is. Um, it should not get worse when it gets bigger. Uh, it's it's really failing to because again, they're running this on toy problems like Maze and Suduku. And I know they got some arc results. It's not clear to me the arc results are really generalizable. Um, so in my opinion, it's like you you solve some very toy problems. We need ultimately if this is going to be useful to find a way where we could build bigger TRMs and have them not just horribly overfit or it will not be useful. And even then it's still debatable the whether this formulation can translate to a general purpose model instead of these were trained on just you know single tasks right so it seems that these models like I said because they're they're so small they they're very special purpose and not applicable to I mean they're aiming for generalization here as she says, but because they're not uh I guess don't handle a larger number of parameters, they're not better for for general sorts of problems. They're just tailored to to solve uh this particular one. U kind of like tuning a model for for a benchmark as opposed to uh using the benchmark to test its applicability for certain problems. Yeah, I mean the analogy I would give is is um many of us may have done as one of our first neural network problems um solving emnest with a a model with just a single hidden layer. Okay. And you can actually get I forget you know around 80% accuracy or whatever on those 10 different um digits. But fundamentally just a one hidden layer network is not a great architecture for solving hard problems. And if you build bigger and bigger ones on mnest it largely just overfits and it doesn't it doesn't ever get better. Um, and that's that's a sign that this is not really a a great learning architecture in my you know it in my mind that that that's sort of a sign. And so ultimately you know we had CNN's that we built and the CNN's you could build them bigger and they didn't just massively overfit when they got bigger. Um I I'm maybe doing a little bit loose on my analogy here, but but that's that's where I feel like if the problem is easy enough like emnest and maybe suduku is not that easy, but I I I think of maze solving as being very easy um you might actually just have a fairly simple not great not learning not extensible architecture that can solve it. And it's not necessarily super surprising. That's that's that's my concern with with this particular finding. It it's not doesn't mean for sure that that's the case, but but it is a red flag that it could be the case that this is just solving a simple problem with something that's not a general purpose algorithm for building better neural networks. But is that a statement, Ted, about the architecture or is it a statement about the training methodology? >> It could be some combination of those and and to be honest, it's probably mostly just saying here's one lone researcher and she did this and she did this on one GPU. We can't expect her to have solved everything. So I'm saying definitely this is not this is not a criticism on this particular paper but I'm saying on this direction somebody needs to solve this or this thing is going to in my opinion be a dead end. >> Yeah. Fair enough. Yeah. >> Yeah. Um so she says she said in the section before uh by dropping one of the networks um they reduce the number of parameters by half and um and then by reducing the layers um because of the overfitting problem um they reduced it to two or to half and again dropped the number of parameters by half. So yeah, it it's that seems to be fall in line with that problem Ted just mentioned where um if it has the problem with overfitting because of these it can't use the parameters to uh solve to be scaled up to solve other problems. um then it's is very limited and its use and not the right way to go for the general purpose problem solving. Um but um just in case uh she says um she uh references another uh paper that uh described uh optimal performance uh for two layers uh in the context of deep equilibrium diffusion models. uh but and and had problems uh going to trying to get make larger networks. Um so uh so that may be uh the drawback to this particular implementation that they have. Uh that remains to be seen, but uh when there's not enough data um well um still talking about the overfitting. Um here she says it might be that there's too little data uh within the network. Um so uh then you're then you're increasing the the memory size of the network in order to carry more of the um of the weights and the the parameters that you need. Um yeah, so this here um using tiny networks with deep recursion and deep supervision appears to allow us to bypass a lot of the overfitting, but that may be a problem with how she's interpreting the the results of the experiments as opposed to um whether uh uh it's a rather than the rather than the algorithm being able to utilize the sparse amount of data that it has to generalize correctly. Um it may be the problem is that the the networks are too small to generalize correctly. So uh which direction it is is uh I guess will have to be done. Uh more more research on that. Um and she talks then uh deals with self attention in the networks um and says that it's proven for long context lengths. Um but uh but for other tasks uh a linear layer is cheap requiring only a matrix of LL parameters. Um, so they replace the self attention layer with a multi-layer perceptron uh on the sequence length um and uh got better results on the Sudoku Extreme. extreme. Um, however, we found this architecture to be suboptimal for tasks with with larger context length. We show results with and without self attention for all experiments. Yeah. And and once again, if you don't mind me jumping in, Alonso, I think this is a huge red flag. If you say, uh, we can just solve this with, you know, a fully connected layer instead of attention. That's a sign that the problem you're solving is very easy in my book. >> Yeah. Yeah. >> But at least they included those uh the results with the network. Um but that just may show that other parts of the um the architecture that they then um built uh that there are other failings or or lack of uh completeness to to handle um the problems. um that most I guess most people are are trying to apply these models for. Uh it's it tends to be too application specific this way. >> Yeah, we're um we're running against our time limit. If I could make one quick comment. Um, I've heard people talk about this Suduku thing and one of the things you hear people when they're talking about LLMs and now they're trying to create these thinking models and do reasoning. And one of the the things that specifically seems to help is when a model is capable of saying, well, I started down on this track, but let me backtrack and let me try a different approach. Okay. Um, that's that's a characteristic that people generally look for in some of these thinking traces. And I've heard people talk about how that might be something happening in Suduku. Um, I don't know where I am on the grand scheme of like skilled suduku people, but like at one point I kind of got into it and started doing some really gnarly, crazy hard um, sodukus. And for the record, I do no backtracking when I solve a soduku. It's completely deterministic. So if you do the bookkeeping then you say oh I see this pattern therefore this has to be a one. Oh I see this pattern therefore in my bookkeeping I can say this cell cannot be a three. And then it's it's totally forward. There's no backtracking. And so maybe that's where the um the fact that the MLP was able to do the suduku pretty well. That's consistent with with my experience that it does not require complex reasoning. It just requires that you have all the right rules for knowing that if there's already a two in that row, then you cannot have a two here. So then the back propagation may solve the the suduku problem as opposed to the uh the deep recursion that they're >> Yeah. Yeah. Yeah. Some people are saying that that that suduku requires intelligence because you have to say well what if I put a one here then I can't put a two there then I this then I that then I that is not the way that I know uh uh advanced suduku players solve things they don't they don't have to do any what if they just simply say bookkeeping bookkeeping bookkeeping oh look this cell cannot be you know a 2 three four 5 6 7 8 or nine. It must be a one. I'm going to fill in the one. And then they're going to see I see a pattern of these four cells in a rectangle that have these possibilities. Therefore, this one has to be a four. There's no what if involved in in in the way that I know how people solve suduku. So, I just wanted to share that. >> Yeah. In other words, you could deterministically compute the result of a pseudo queue in non-prohibitively large times. You don't need really >> there's no search. Exactly. We don't need we don't need prohibitive uh times because there's no combinatorial search. It's just bookkeeping, bookkeeping, bookkeeping, bookkeeping. And at some point, you're like, I now know that this cell >> can be nothing but a one, so I'll fill it in. And then you keep going. >> So >> it's not quite as bad as the traveling salesman problem. >> Exactly. That that's what I'm trying to get is that is that. So So for people who don't know advanced seduku, you might think that it involves that prohibitive search. But my limited experience with hard seduku is that it does not. >> Yeah.
Original Description
This week we reviewed "Less is More: Recursive Reasoning with Tiny Networks" https://arxiv.org/abs/2510.04871 This paper explores using a model recursively to solve reasoning problems on small experiments. This is not a state of the art model, but it is an interesting set of experiments using a very small model in a non-traditional way.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Reading ML Papers
View skill →Related Reads
📰
📰
📰
📰
On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]
Reddit r/MachineLearning
CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available
Medium · Data Science
Found a potential mistake in an ICLR 2026 blogpost [D]
Reddit r/MachineLearning
Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement
ArXiv cs.AI
🎓
Tutor Explanation
DeepCamp AI