Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

📰 ArXiv cs.AI

arXiv:2604.13054v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as

Published 16 Apr 2026

Read full paper → ← Back to Reads