Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

📰 ArXiv cs.AI

arXiv:2604.09687v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2

Published 14 Apr 2026
Read full paper → ← Back to Reads