MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

📰 ArXiv cs.AI

arXiv:2604.22881v1 Announce Type: cross Abstract: Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes G

Published 28 Apr 2026

Read full paper → ← Back to Reads