Offline Policy Optimization with Posterior Sampling

📰 ArXiv cs.AI

arXiv:2605.07393v1 Announce Type: new Abstract: A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifi

Published 11 May 2026
Read full paper → ← Back to Reads