M*: A Modular, Extensible, Serving System for Multimodal Models

📰 ArXiv cs.AI

arXiv:2606.12688v1 Announce Type: cross Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving fram

Published 12 Jun 2026

Read full paper → ← Back to Reads