MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

📰 ArXiv cs.AI

arXiv:2308.12067v3 Announce Type: replace-cross Abstract: Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 example

Published 14 Apr 2026

Read full paper → ← Back to Reads