ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
📰 ArXiv cs.AI
arXiv:2507.21433v3 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe
DeepCamp AI