THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

📰 ArXiv cs.AI

arXiv:2601.23143v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reas

Published 11 May 2026
Read full paper → ← Back to Reads