Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
📰 ArXiv cs.AI
arXiv:2604.09595v1 Announce Type: cross Abstract: Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance -- a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing
DeepCamp AI