Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

📰 ArXiv cs.AI

arXiv:2604.08846v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference ti

Published 13 Apr 2026

Read full paper → ← Back to Reads