PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

📰 ArXiv cs.AI

arXiv:2604.28123v1 Announce Type: cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning

Published 1 May 2026

Read full paper → ← Back to Reads