VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

📰 ArXiv cs.AI

arXiv:2604.09531v1 Announce Type: cross Abstract: Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce

Published 13 Apr 2026

Read full paper → ← Back to Reads