dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
📰 ArXiv cs.AI
arXiv:2603.18806v2 Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unm
DeepCamp AI