RL without TD learning
📰 BAIR Blog
Reinforcement learning without temporal difference learning using a divide and conquer approach
Action Steps
- Understand the limitations of traditional temporal difference learning in reinforcement learning
- Learn about the divide and conquer approach and its potential to scale to long-horizon tasks
- Explore the application of this approach in off-policy reinforcement learning
- Investigate the use of this approach in domains where data collection is expensive, such as robotics and healthcare
Who Needs to Know This
Researchers and engineers working on reinforcement learning and robotics can benefit from this approach as it provides a new paradigm for value learning that can scale to complex, long-horizon tasks
Key Insight
💡 The divide and conquer approach provides a fundamentally different way to solve the error accumulation problem in value learning
Share This
💡 Reinforcement learning without TD learning: a new divide and conquer approach
Key Takeaways
Reinforcement learning without temporal difference learning using a divide and conquer approach
Full Article
# RL without TD learning – The Berkeley Artificial Intelligence Research Blog
[](http://bair.berkeley.edu/blog/)
[Subscribe](http://bair.berkeley.edu/blog/subscribe/)[About](http://bair.berkeley.edu/blog/about/)[Archive](http://bair.berkeley.edu/blog/archive/)[BAIR](http://bair.berkeley.edu/)
# RL without TD learning
[Seohong Park](https://seohong.me/) Nov 1, 2025
In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: **divide and conquer**. Unlike traditional methods, this algorithm is _not_ based on temporal difference (TD) learning (which has [scalability challenges](https://seohong.me/blog/q-learning-is-not-yet-scalable/)), and scales well to long-horizon tasks.

_We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning._
## Problem setting: off-policy RL
Our problem setting is **off-policy RL**. Let’s briefly review what this means.
There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can _only_ use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.
Off-policy RL means we don’t have this restriction: we can use _any_ kind of data, including old experience, human demonstrations, Internet data, and so on. So off-policy RL is more general and flexible than on-policy RL (and of course harder!). Q-learning is the most well-known off-policy RL algorithm. In domains where data collection is expensive (_e.g._, **robotics**, dialogue systems, healthcare, etc.), we often have no choice but to use off-policy RL. That’s why it’s such an important problem.
As of 2025, I think we have reasonably good recipes for scaling up on-policy RL (_e.g._, PPO, GRPO, and their variants). However, we still haven’t found a “scalable” _off-policy RL_ algorithm that scales well to complex, long-horizon tasks. Let me briefly explain why.
## Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)
In off-policy RL, we typically train a value function using temporal difference (TD) learning (_i.e._, Q-learning), with the following Bellman update rule:
Q(s,a)←r+γ max a′Q(s′,a′),
The problem is this: the error in the next value Q(s′,a′) propagates to the current value Q(s,a) through bootstrapping, and these errors _accumulate_ over the entire horizon. This is basically what makes TD learning struggle to scale to long-horizon tasks (see [this post](https://seohong.me/blog/q-learning-is-not-yet-scalable/) if you’re interested in more details).
To mitigate this problem, people have mixed TD learning with Monte Carlo (MC) returns. For example, we can do n-step TD learning (TD-n):
Q(s t,a t)←∑i=0 n−1 γ i r t+i+γ n max a′Q(s t+n,a′).
Here, we use the actual Monte Carlo return (from the dataset) for the first n steps, and then use the bootstrapped value for the rest of the horizon. This way, we can reduce the number of Bellman recursions by n times, so errors accumulate less. In the extreme case of n=∞, we recover pure Monte Carlo value learning.
While this is a reasonable solution (and often [works well](https://arxiv.org/abs/2506.04168)), it is highly unsatisfactory. First, it doesn’t _fundamentally_ solve the error accumulation problem; it only reduces the number of Bellman recursions by a constant factor (n). Second, as n grows, we suffer from high variance and suboptimality. So we can’t just set n to a large value, and need to carefully tune it for each task.
Is there a fundamentally different way to solve this problem?
## The “Third” Paradigm: Divide and Conquer
My claim is that a _third_ paradigm in value learning, **divide and conquer**, may provide an
[](http://bair.berkeley.edu/blog/)
[Subscribe](http://bair.berkeley.edu/blog/subscribe/)[About](http://bair.berkeley.edu/blog/about/)[Archive](http://bair.berkeley.edu/blog/archive/)[BAIR](http://bair.berkeley.edu/)
# RL without TD learning
[Seohong Park](https://seohong.me/) Nov 1, 2025
In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: **divide and conquer**. Unlike traditional methods, this algorithm is _not_ based on temporal difference (TD) learning (which has [scalability challenges](https://seohong.me/blog/q-learning-is-not-yet-scalable/)), and scales well to long-horizon tasks.

_We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning._
## Problem setting: off-policy RL
Our problem setting is **off-policy RL**. Let’s briefly review what this means.
There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can _only_ use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.
Off-policy RL means we don’t have this restriction: we can use _any_ kind of data, including old experience, human demonstrations, Internet data, and so on. So off-policy RL is more general and flexible than on-policy RL (and of course harder!). Q-learning is the most well-known off-policy RL algorithm. In domains where data collection is expensive (_e.g._, **robotics**, dialogue systems, healthcare, etc.), we often have no choice but to use off-policy RL. That’s why it’s such an important problem.
As of 2025, I think we have reasonably good recipes for scaling up on-policy RL (_e.g._, PPO, GRPO, and their variants). However, we still haven’t found a “scalable” _off-policy RL_ algorithm that scales well to complex, long-horizon tasks. Let me briefly explain why.
## Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)
In off-policy RL, we typically train a value function using temporal difference (TD) learning (_i.e._, Q-learning), with the following Bellman update rule:
Q(s,a)←r+γ max a′Q(s′,a′),
The problem is this: the error in the next value Q(s′,a′) propagates to the current value Q(s,a) through bootstrapping, and these errors _accumulate_ over the entire horizon. This is basically what makes TD learning struggle to scale to long-horizon tasks (see [this post](https://seohong.me/blog/q-learning-is-not-yet-scalable/) if you’re interested in more details).
To mitigate this problem, people have mixed TD learning with Monte Carlo (MC) returns. For example, we can do n-step TD learning (TD-n):
Q(s t,a t)←∑i=0 n−1 γ i r t+i+γ n max a′Q(s t+n,a′).
Here, we use the actual Monte Carlo return (from the dataset) for the first n steps, and then use the bootstrapped value for the rest of the horizon. This way, we can reduce the number of Bellman recursions by n times, so errors accumulate less. In the extreme case of n=∞, we recover pure Monte Carlo value learning.
While this is a reasonable solution (and often [works well](https://arxiv.org/abs/2506.04168)), it is highly unsatisfactory. First, it doesn’t _fundamentally_ solve the error accumulation problem; it only reduces the number of Bellman recursions by a constant factor (n). Second, as n grows, we suffer from high variance and suboptimality. So we can’t just set n to a large value, and need to carefully tune it for each task.
Is there a fundamentally different way to solve this problem?
## The “Third” Paradigm: Divide and Conquer
My claim is that a _third_ paradigm in value learning, **divide and conquer**, may provide an
DeepCamp AI