Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

📰 ArXiv cs.AI

arXiv:2604.27747v1 Announce Type: cross Abstract: Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is re

Published 1 May 2026
Read full paper → ← Back to Reads