CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

📰 ArXiv cs.AI

CARV is a diagnostic benchmark for compositional analogical reasoning in multimodal large language models

advanced Published 31 Mar 2026

Action Steps

Identify the limitations of existing evaluations for analogical reasoning in MLLMs
Develop a novel task and dataset that tests compositional analogical reasoning
Evaluate MLLMs using the CARV benchmark to assess their ability to compose rules from multiple sources
Analyze the results to improve the models' higher-order intelligence capabilities

Who Needs to Know This

AI researchers and engineers working on multimodal LLMs can benefit from CARV to evaluate and improve their models' compositional analogical reasoning capabilities

Key Insight

💡 CARV addresses the gap in existing evaluations by testing the ability to compose rules from multiple sources