Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

📰 ArXiv cs.AI

Research dissociates Theory of Mind and self-attributions of mentality in Large Language Models (LLMs)

advanced Published 1 Apr 2026

Action Steps

Investigate the relationship between Theory of Mind (ToM) and self-attributions of mentality in LLMs
Conduct safety ablation and mechanistic analyses of representational similarity to understand the impact of suppressing mind-attribution tendencies on ToM
Analyze the results to determine if ToM and self-attributions of mentality are dissociable in LLMs
Apply the findings to improve safety fine-tuning and socio-cognitive abilities in LLMs

Who Needs to Know This

AI researchers and engineers working on LLMs and safety fine-tuning can benefit from this research to improve model performance and safety, while ML researchers can apply these findings to develop more sophisticated socio-cognitive abilities in AI models

Key Insight

💡 Theory of Mind and self-attributions of mentality can be separated in LLMs, allowing for more targeted safety fine-tuning