Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

📰 ArXiv cs.AI

arXiv:2604.11120v1 Announce Type: new Abstract: Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system promp

Published 14 Apr 2026
Read full paper → ← Back to Reads