RL without TD learning

📰 BAIR Blog

Reinforcement learning without temporal difference learning using a divide and conquer approach

advanced Published 1 Nov 2025

Action Steps

Understand the limitations of traditional temporal difference learning in reinforcement learning
Learn about the divide and conquer approach and its potential to scale to long-horizon tasks
Explore the application of this approach in off-policy reinforcement learning
Investigate the use of this approach in domains where data collection is expensive, such as robotics and healthcare

Who Needs to Know This

Researchers and engineers working on reinforcement learning and robotics can benefit from this approach as it provides a new paradigm for value learning that can scale to complex, long-horizon tasks

Key Insight

💡 The divide and conquer approach provides a fundamentally different way to solve the error accumulation problem in value learning

Key Takeaways

Reinforcement learning without temporal difference learning using a divide and conquer approach

Full Article

# RL without TD learning – The Berkeley Artificial Intelligence Research Blog

[![Image 1](http://bair.berkeley.edu/blog/assets/BAIR_Logo.png)](http://bair.berkeley.edu/blog/)

[Subscribe](http://bair.berkeley.edu/blog/subscribe/)[About](http://bair.berkeley.edu/blog/about/)[Archive](http://bair.berkeley.edu/blog/archive/)[BAIR](http://bair.berkeley.edu/)

# RL without TD learning

[Seohong Park](https://seohong.me/) Nov 1, 2025

In this post, I’ll introduce a reinforcement learning (RL) algorithm based on an “alternative” paradigm: **divide and conquer**. Unlike traditional methods, this algorithm is _not_ based on temporal difference (TD) learning (which has [scalability challenges](https://seohong.me/blog/q-learning-is-not-yet-scalable/)), and scales well to long-horizon tasks.

![Image 2](https://bair.berkeley.edu/static/blog/rl-without-td-learning/teaser_short.png)

_We can do Reinforcement Learning (RL) based on divide and conquer, instead of temporal difference (TD) learning._

## Problem setting: off-policy RL

Our problem setting is **off-policy RL**. Let’s briefly review what this means.

There are two classes of algorithms in RL: on-policy RL and off-policy RL. On-policy RL means we can _only_ use fresh data collected by the current policy. In other words, we have to throw away old data each time we update the policy. Algorithms like PPO and GRPO (and policy gradient methods in general) belong to this category.

Off-policy RL means we don’t have this restriction: we can use _any_ kind of data, including old experience, human demonstrations, Internet data, and so on. So off-policy RL is more general and flexible than on-policy RL (and of course harder!). Q-learning is the most well-known off-policy RL algorithm. In domains where data collection is expensive (_e.g._, **robotics**, dialogue systems, healthcare, etc.), we often have no choice but to use off-policy RL. That’s why it’s such an important problem.

As of 2025, I think we have reasonably good recipes for scaling up on-policy RL (_e.g._, PPO, GRPO, and their variants). However, we still haven’t found a “scalable” _off-policy RL_ algorithm that scales well to complex, long-horizon tasks. Let me briefly explain why.

## Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)

In off-policy RL, we typically train a value function using temporal difference (TD) learning (_i.e._, Q-learning), with the following Bellman update rule:

Q(s,a)←r+γ max a′Q(s′,a′),

The problem is this: the error in the next value Q(s′,a′) propagates to the current value Q(s,a) through bootstrapping, and these errors _accumulate_ over the entire horizon. This is basically what makes TD learning struggle to scale to long-horizon tasks (see [this post](https://seohong.me/blog/q-learning-is-not-yet-scalable/) if you’re interested in more details).

To mitigate this problem, people have mixed TD learning with Monte Carlo (MC) returns. For example, we can do n-step TD learning (TD-n):

Q(s t,a t)←∑i=0 n−1 γ i r t+i+γ n max a′Q(s t+n,a′).

Here, we use the actual Monte Carlo return (from the dataset) for the first n steps, and then use the bootstrapped value for the rest of the horizon. This way, we can reduce the number of Bellman recursions by n times, so errors accumulate less. In the extreme case of n=∞, we recover pure Monte Carlo value learning.

While this is a reasonable solution (and often [works well](https://arxiv.org/abs/2506.04168)), it is highly unsatisfactory. First, it doesn’t _fundamentally_ solve the error accumulation problem; it only reduces the number of Bellman recursions by a constant factor (n). Second, as n grows, we suffer from high variance and suboptimality. So we can’t just set n to a large value, and need to carefully tune it for each task.

Is there a fundamentally different way to solve this problem?

## The “Third” Paradigm: Divide and Conquer

My claim is that a _third_ paradigm in value learning, **divide and conquer**, may provide an

Read full paper → ← Back to Reads