FlashSampling: Fast and Memory-Efficient Exact Sampling

📰 ArXiv cs.AI

arXiv:2603.15854v2 Announce Type: replace-cross Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per v

Published 14 May 2026
Read full paper → ← Back to Reads