THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
📰 ArXiv cs.AI
arXiv:2601.23143v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reas
DeepCamp AI