Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

📰 ArXiv cs.AI

arXiv:2604.09544v1 Announce Type: cross Abstract: Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe

Published 13 Apr 2026

Read full paper → ← Back to Reads