PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

📰 ArXiv cs.AI

arXiv:2606.19534v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational base

Published 19 Jun 2026
Read full paper → ← Back to Reads