Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
📰 ArXiv cs.AI
arXiv:2604.11120v1 Announce Type: new Abstract: Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system promp
DeepCamp AI