FlashSampling: Fast and Memory-Efficient Exact Sampling
📰 ArXiv cs.AI
arXiv:2603.15854v2 Announce Type: replace-cross Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per v
DeepCamp AI