Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

📰 ArXiv cs.AI

Large vision language models' responses may not reflect their internal understanding of visual documents

advanced Published 7 Apr 2026

Action Steps

Evaluate the performance of large vision language models on visual document understanding benchmarks
Analyze the generated responses to identify potential gaps between internal representations and responses
Investigate the internal workings of the models to understand how they process and represent visual documents
Develop new evaluation metrics that go beyond generated responses to assess the models' true understanding

Who Needs to Know This

AI researchers and engineers working on visual document understanding tasks can benefit from understanding the gap between internal representations and responses, as it can inform the development of more accurate models

Key Insight

💡 The performance of large vision language models on visual document understanding tasks may be overestimated due to the reliance on generated responses as evaluation metrics