RepIt: Steering Language Models with Concept-Specific Refusal Vectors

📰 ArXiv cs.AI

Learn how RepIt steers language models with concept-specific refusal vectors for safer interactions

advanced Published 22 Apr 2026

Action Steps

Build a RepIt framework to isolate concept-specific representations in LM activations
Run experiments to evaluate the effectiveness of RepIt in suppressing refusal on targeted concepts
Configure language models with RepIt to steer their responses and avoid localized vulnerabilities
Test RepIt with various concept-specific refusal vectors to assess its robustness
Apply RepIt to real-world applications to improve language model safety and reliability

Who Needs to Know This

NLP engineers and AI researchers can benefit from RepIt to improve language model safety and mitigate potential vulnerabilities

Key Insight

💡 RepIt enables selective suppression of refusal on targeted concepts, making language models safer and more reliable

Key Takeaways

Learn how RepIt steers language models with concept-specific refusal vectors for safer interactions

Full Article

Title: RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Abstract:
arXiv:2509.13281v5 Announce Type: replace Abstract: Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations. While existing steering methods already achieve high attack success rates through broad interventions, RepIt enables a more concerning capability: selective suppression of refusal on targeted concepts wh

Read full paper → ← Back to Reads