Do LLMs Know When They're Wrong?

Martin Andrews · Beginner ·🧠 Large Language Models ·9mo ago

Key Takeaways

The video discusses recent research papers on Large Language Models (LLMs) that can gauge their own uncertainty to improve reasoning, including the ARPO method and Deep Think with Confidence, which utilize entropy to guide exploration and estimate performance.

Full Transcript

So in today's video I'm going to talk about LLM confidence. In particular I'll talk about two recent papers. One called ARPO and the other deep think with confidence. As a prelude I'm going to go back and talk about some of the things which were circulating around the time 01 was released by OpenAI. Hopefully understanding these kind of ideas will give you an intuition about how this research came about and also the scope for experimenting with these things yourself. So back in September 2024, OpenAI released 01 uh with this blog post and also showing off the model. This was an extremely exciting time, but of course everyone was interested how are they doing this. So shortly after 01 was released um I talked at the machine learning Singapore meetup about um how this might be happening with a whole presentation here talking about the different things which might be going on. So my presentations at the meetup weren't videoed but I'll give you a very quick catch up now and so we can lead through to the research which has just been published. So one of the theories that people had about how 01 worked traced back to work done in 2023 at OpenAI. Let's verify step by step. And in this work they traced through mathematical proofs in a stepwise fashion trying to get the model to figure out whether or not it had improved its odds of finishing the proof at each step. And so this paper showed the benefits of having a process reward model which is looking at it step by step as opposed to an outcome reward model where you only give a reward for the final answer. This is a figure from that paper in which you can see that either majority voting or the outcome rewards essentially flatten out in terms of performance as you apply more scaling at test time. Whereas the process reward model keeps going up and to the right. And so this was a very interesting result and the paper further fueled the idea that there might be something interesting going on with this sentence. We consider this a compelling direction for future research which people took as an indication that they'd already found something compelling. As an aside, one slide I had in my previous presentation pointed out like a slight problem with 01. So this was OpenAI's much touted test time scaling observation showing how you could improve performance by adding compute and that in itself is an interesting observation. However, touting this as being a log linear scaling is actually the opposite of what you want really. You want an exponential linear scaling or even a linear linear scaling. Showing this log linear scaling is really claiming that to get a plus x% of accuracy we need to do twice as much compute which is essentially saying you can get as much accuracy as you want by burning exponential amounts of money but even so people were very excited by this graph. Another thing which I found interesting was this physics of language models tutorial at ICML in July of 2024. Now in this one of the interesting things which I had also already been exploring was the fact that LLMs kind of know whether they have made a mistake once they've emitted the tokens for that mistake. So they may not know ahead of time that they're going to make a mistake but after they've outputed some bad tokens they know they're in a tough spot. So this is kind of an interesting thing which could possibly guide uh like a reasoning trace. So my super simple idea from early 2024 was do over training where essentially we'd look at the observation that these things know that they're doing something wrong. Um, maybe we can train an LLM to say whoops after it makes mistakes. And we could then make a data set of mistakes followed by whoops followed by the correct answer, which essentially would allow the LLM to backtrack or learn to backtrack. So by September, other people had published similar ideas. So at least I knew I was in good company. On the other hand, that didn't really answer what is 01 doing which was top of mind for everyone in the field. Now, another line of thinking which kind of went wild on Twitter was Entropics. And some of the key people here, I think, were Doomslide aka Frog. Um, there's also Shrek. There's a bunch of people very involved in having a look at the entropy of the distribution of token output to try and figure out what the LLM was feeling at the time and whether they could guide it in different directions. And Twitter being Twitter, this turned into a meme fest. On the other hand, there was a repo of code. There was clearly serious work being done and people were just wondering, okay, what's going to come of this? To go back to the outline, I've talked about uh the 01 launch, the isimal tutorial, entropics. The news here is we've got signs of life of entropics like methods, ARPO and deep think with confidence. So I'm now going to go through those two papers. So the ARPO paper came out in July of 2025. Um it stands for agentic reinforced policy optimization. And the idea here was based around the fact that LLMs show spikes of token entropy right after using tools. So we could use that as a kind of insight into what's going on to guide exploration and also to try and estimate how well we were doing as as we're rolling out. They published a code repo and I'll link all of this in the description and the paper is on archive so everyone can read this. So taking a snapshot from the first page of their paper, you can see here a graph of the entropy of the tokens following various calls to tools. So this is a whole agentic thing. Um they're trying to figure out does the model uh like the tools which have been called? How surprised is it after it receives tool feedback? Are the tools giving it back out of distribution kind of tokens? and then using that to kind of guide whether we should be using this as a branching point in a roll out strategy for learning how to do agentic stuff. So you can see here from the top that the entropy is highly variable. So what does it mean? So for the entropy here we're looking at the distribution across all tokens being output at every step. And a high entropy would essentially be all of the a very even distribution amongst all possible tokens. Whereas a low entropy would be a very certain um output. So that essentially the LLM knows what it's going to do next. So you can say that the low entropy is very very settled and the high entropy has a large degree of variability which indicates uncertainty. So the results shown in the paper are very very attractive that I think we have to be careful about how many roll outs we're allowing ourselves whether it's pass at one or pass at five but clearly this ARPO method is showing some advantage by measuring the uncertainty in the LLM's outputs to try and gauge whether it should be tool calling how it should be branching and from the graph at the bottom we can see that ARPO learns to be more efficient with the tool pools it calls. Switching over now from the agent setting to one which is more about LLMs for mathematics. This is deep think with confidence which is a meta paper from August of this year. This got way more attention on Twitter and in particular drew the comparison with entropics from the year before. So in this method they take the confidence to be essentially a fixed window moving average over the negative mean top k log props. So basically they're looking at um the degree of uncertainty over a moving window as as we go along. So rather than looking at the entropy of each token, they're looking at a confidence measure, which is the sliding window of the top K, which are the most probable tokens. But here, what they're doing is they're trying to find out how unlikely are the most probable tokens, i.e. how unconfident are we that the correct answer is in the top K. Now, this is also quite easy to calculate and easy to deploy. and they've made a pull request into VLM of about 50 lines of code uh where you can then have this as a flag. We can switch it on in VLM to actually pull out this as a measure. So, Meta has a nice project page for this paper which links to the paper um the VLM code and some examples as well as the results and they have kind of a nice animation. So playing through this animation, it shows a bunch of trajectories for some mathematical reasoning. Now what we can see here is each of these trajectories if it remains confident for the whole way until it gets to the the tick or it gets to the place where it decides to finish. Um now we have some end points for these trajectories. On the other hand, there are other trajectories here where it actually ends in a low confidence kind of trace and just gets eliminated. And so what they're doing is they track um for the ones where it remains confident until it gets an answer. They then take an an average over those to determine the actual answer being presented back. So here the reasoning outcome rather than being just a plain majority vote is essentially weighted by confidence and they show that this is a very effective way of getting to the right answers. So they also link some code which implements their method and in particular they show how this can be embedded within VLM. And if you don't want to just use their pull request directly, they actually show you how to copy paste it into your own local copy of VLLM uh in a step-by-step way. So, as I mentioned before, there are some very encouraging results on the AIM 2025, a hard maths exam. And here they're showing extremely high performance at 99.9% for this deep conf at 512. Now the model they're using here is the open AI open- source models. Um these are good models but clearly enhancing it with this kind of confidence measure can increase the performance even more. So as kind of a summary here I'd say yes this thing definitely works. There is a performance boost. Um they say it's ultra efficient which reflects the number of tokens being generated. on the other hand um this is achieving the 99.9% accuracy with 512 rollouts. So I would kind of question whether doing this number of rollouts is real reasoning. From a results point of view, you might argue that anything which gets us to the right results correctly is good reasoning. Therefore, we've made a good reasoning model. On the other hand, I'm pretty sure that if I were doing AIM questions, if someone saw me taking 500 separate attempts, they would reasonably assume I really didn't understand what I was doing. So, in that sense, maybe this isn't the real reasoning we're looking for. So, to wrap up, these research ideas take some time to brew. Um, there's an aside I have with this doomslide tweet. And so here Frog is saying that he's very happy that this result has finally come out of for work which they've been exploring maybe 10 months earlier, but also saying that this entropics thing has been a bit of a millstone where there's been so many other things to investigate and people keep drawing Frog and Shrek back into this what is happening with entropics question. So the second point is yes, this thing actually works and it's going to be in VLM very soon. And lastly, I'd say that this direction has proven to be fairly accessible. While it may take large resources to be doing 512 rollouts, the basic idea or the basic principle doesn't need very many evaluations to see whether we're on the right track. So the question then is what other directions are interesting and what else could be explored? Thanks for watching. If you'd like me to cover more of these reinforcement learning topics or maybe agents or rag, um, I'd be very glad to read in the comments what you're interested in. I see this YouTube channel as being an opportunity for reinforcement learning in real time since trying to judge what people are interested in and then getting measured on view time and all these other statistics makes it for a very interesting problem. If you can help me out by commenting to let me know what is good, that would enable me to direct things in a much more beneficial way. So, looking forward to that. See you next time. Cheers.

Original Description

We're moving past LLMs that just predict the next word. Discover a new frontier: models that can gauge their own uncertainty to improve reasoning. This video explores two brand new papers that turn the "Entropix" meme into practical, working code. Current methods like Chain-of-Thought are powerful, but they are essentially a model "thinking out loud." What if a model could recognize when it's on a bad path and correct itself? This is the core idea behind using token entropy and logprobs as a "confidence" signal. This video is for the AI builder, developer, and enthusiast who wants to look under the hood. We break down the history of this idea (from OpenAI's o-1 hints to Twitter theories) and then dive into the mechanics of two pivotal papers: 1. **ARPO**: Agentic Reinforced Policy Optimization 2. **Deep Think with Confidence**: A practical vLLM implementation from Meta By the end, you'll understand not just *what* LLM confidence is, but *how* it works, and *why* it's a compelling direction for building more capable and efficient agentic systems. --- ### Papers & Resources Mentioned * [ARPO : Agentic Reinforced Policy Optimization (Dong et al., 2025)](https://arxiv.org/abs/2507.19849) + [ARPO GitHub Repo](https://github.com/dongguanting/ARPO) * [Deep Think with Confidence (Fu et al., 2025)](https://arxiv.org/abs/2508.15260) + [DeepThink Project Page (Meta AI)](https://jiaweizzhao.github.io/deepconf/) + [DeepThink Pull Request for vLLM](https://github.com/vllm-project/vllm/pull/23201) * [OpenAI o-1 Blog Post](https://openai.com/index/learning-to-reason-with-llms/) + [Let's Verify Step-by-Step (OpenAI, 2023)](https://arxiv.org/abs/2305.20050) * [ICML 2024 Tutorial: Physics of Language Models](https://www.youtube.com/watch?v=yBL7J0kgldU) --- ### Chapters 00:00 - Introduction: The Idea of LLM Confidence 00:31 - Background: From OpenAI's o-1 to the "Entropix" Meme 05:26 - Paper 1: ARPO & Agentic Rollout Confidence 07:55 - Paper 2: Meta's "Deep Think wi
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

This video explores recent research papers on LLMs that can gauge their own uncertainty to improve reasoning, including the ARPO method and Deep Think with Confidence. By understanding these concepts, viewers can improve their skills in reading and reproducing research papers on LLMs, as well as applying research methods to improve LLM reasoning.

Key Takeaways
  1. Read recent research papers on LLMs
  2. Understand the concept of LLM confidence
  3. Apply research methods to improve LLM reasoning
  4. Implement LLM models using code
  5. Use ARPO and Deep Think with Confidence to improve LLM performance
💡 The concept of LLM confidence and uncertainty can be used to improve LLM reasoning and performance, and recent research papers have proposed methods such as ARPO and Deep Think with Confidence to achieve this.

Related AI Lessons

Chapters (4)

Introduction: The Idea of LLM Confidence
0:31 Background: From OpenAI's o-1 to the "Entropix" Meme
5:26 Paper 1: ARPO & Agentic Rollout Confidence
7:55 Paper 2: Meta's "Deep Think wi
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →