Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
📰 ArXiv cs.AI
arXiv:2601.07667v2 Announce Type: replace-cross Abstract: Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are sele
DeepCamp AI