PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

📰 ArXiv cs.AI

arXiv:2604.08986v1 Announce Type: cross Abstract: Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time

Published 13 Apr 2026

Read full paper → ← Back to Reads