Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

📰 ArXiv cs.AI

arXiv:2604.10326v1 Announce Type: cross Abstract: Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering (HMNS), a circuit-level intervention that (i) identifies attention heads most causally responsible for a model's default behavior, (ii) suppresses their write paths via targeted column masking, and (iii) injects a

Published 14 Apr 2026

Read full paper → ← Back to Reads