Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

📰 ArXiv cs.AI

arXiv:2604.09595v1 Announce Type: cross Abstract: Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance -- a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing

Published 14 Apr 2026
Read full paper → ← Back to Reads