"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

📰 ArXiv cs.AI

arXiv:2604.05930v1 Announce Type: cross Abstract: Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchma

Published 8 Apr 2026

Read full paper → ← Back to News