Improved Large Language Diffusion Models

📰 ArXiv cs.AI

arXiv:2606.25331v1 Announce Type: cross Abstract: Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use v

Published 25 Jun 2026

Read full paper → ← Back to Reads