CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

📰 ArXiv cs.AI

arXiv:2511.19820v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVL

Published 15 Apr 2026

Read full paper → ← Back to Reads