Taming LLMs: Using Executable Oracles to Prevent Bad Code

📰 Hacker News (AI)

Using executable oracles can prevent LLMs from generating bad code by limiting their degrees of freedom

advanced Published 26 Mar 2026
Action Steps
  1. Identify areas where LLMs have degrees of freedom that can lead to poor results
  2. Develop executable oracles that can test and validate the output of LLMs
  3. Integrate executable oracles into the testing loop to provide feedback to LLMs
  4. Use opposing executable oracles to pinch LLM results and improve quality
Who Needs to Know This

Developers and researchers working with LLMs can benefit from using executable oracles to improve the quality and reliability of generated code, particularly in areas like compiler development and dataflow transfer function synthesis

Key Insight

💡 Executable oracles can help prevent LLMs from generating bad code by providing a clear and constrained set of goals and validation criteria

Share This
💡 Use executable oracles to limit LLM degrees of freedom and improve code quality

Key Takeaways

Using executable oracles can prevent LLMs from generating bad code by limiting their degrees of freedom

Full Article

Published Time: Thu, 26 Mar 2026 19:27:17 GMT

# zero_dof_programming

# Zero-Degree-of-Freedom LLM Coding using Executable Oracles

**[John Regehr](http://www.cs.utah.edu/~regehr/), March 26 2026.**

* * *

## You Can’t Trust The Damn Things

By this point, most of us who have experimented with Claude, Codex, and other LLM-based coding agents have noticed that the current generation of these can sometimes do good work, at superhuman speed, when given some kinds of highly constrained tasks. For example, coding agents can eat a large, tricky API—such as the one for manipulating LLVM IR—for lunch, and they’ve also given me a number of fixes to non-trivial bugs in real software that could be applied as-is. On the other hand, these same tools frequently fall over in baffling ways, emitting tasteless or nonsensical code.

When an LLM has the option of doing something poorly, we simply can’t trust it to make the right choices. The solution, then, is clear: we need to take away the freedom to do the job badly. The software tools that can help us accomplish this are _executable oracles_. The simplest executable oracle is a test case—but test cases, even when there are a lot of them, are weak. Consider Claude’s C Compiler, which [I wrote about earlier](https://john.regehr.org/writing/claude_c_compiler.html): even after passing GCC’s “torture test suite” and more, it still had 34 nasty miscompilation bugs that were within easy reach. But it wouldn’t have had those bugs if Csmith and YARPGen had been included in the testing loop that was used to bring up this compiler. These tools are better executable oracles because each of them implicitly encodes a vast collection of test cases.

This piece is about collapsing as many failure-producing degrees of freedom as possible. Zero degrees of freedom is aspirational, but a good aspiration.

## Some Example Scenarios

Besides the miscompilations, Claude’s C Compiler also fell over in terms of quality of generated code. The compiler contains a somewhat elaborate (and plausible-looking) set of optimization passes, but they appear to make very little difference in the quality of its output. But what if the human overseeing the creation of this compiler had included an executable oracle for code quality into its testing loop? Well, I’m 100% speculating here, but my educated guess is that Claude would have been able to incorporate this feedback, and would have done a significantly better job optimizing the generated code. What would this oracle actually look like? I probably would have kept it simple—perhaps a count of the number of instructions that get executed, when you run the compiler’s output. I’d also have given it a baseline such as `gcc -O0`, so that the LLM would know where there was the most room for improvement.

**Summary:** The LLM was given a degree of freedom with respect to the quality of CCC’s output, and consequently it did a poor job there.

My group (in collaboration with other folks) is working towards automated synthesis of dataflow transfer functions, such as those used by LLVM’s “known bits” analysis. We wrote [a paper about this](https://users.cs.utah.edu/~regehr/papers/popl26.pdf), where we used randomized synthesis techniques, no LLMs. Recently I asked Codex to start writing transfer functions. By itself, it’s not bad at this, but not great. However, given access to our command-line tools for evaluating the precision and verifying the soundness of a transfer function, Codex produced results that are better than anything I’ve seen either in a real compiler like LLVM, or in our own randomized synthesis results. The remaining degree of freedom that I left Codex—code size—allowed it to write pretty large transfer functions that explore some pretty deep case splits on the input structure, but capping the size of generated code is the easiest thing.

**Summary:** By pinching the LLM’s results between opposing executable oracles for soundness and precision, synthesis of dataflow
Read full article → ← Back to Reads