RAP: Runtime Adaptive Pruning for LLM Inference

📰 ArXiv cs.AI

arXiv:2505.17138v5 Announce Type: replace-cross Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose

Published 19 May 2026

Read full paper → ← Back to Reads