Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

📰 ArXiv cs.AI

arXiv:2606.13156v1 Announce Type: cross Abstract: Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correctio

Published 12 Jun 2026
Read full paper → ← Back to Reads