Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

📰 ArXiv cs.AI

arXiv:2604.11666v1 Announce Type: cross Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the belief

Published 14 Apr 2026
Read full paper → ← Back to Reads