Multimodal Diffusion Forcing for Forceful Manipulation

📰 ArXiv cs.AI

arXiv:2511.04812v2 Announce Type: replace-cross Abstract: Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framewor

Published 14 Apr 2026

Read full paper → ← Back to Reads