Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

📰 ArXiv cs.AI

Graph-of-Mark enhances multimodal language models' spatial reasoning with graph-based visual prompting

advanced Published 27 Mar 2026

Action Steps

Partition the input image into object regions
Annotate the regions with graph-based marks
Feed the augmented image to the multimodal language model
Evaluate the model's spatial reasoning capabilities

Who Needs to Know This

AI engineers and ML researchers can benefit from this approach to improve the performance of multimodal language models, especially in tasks that require spatial reasoning and visual understanding

Key Insight

💡 Graph-based visual prompting can improve the spatial reasoning capabilities of multimodal language models