Your GPU Is Probably Idle
📰 Hackernoon
A GPU holding memory isn't the same as a GPU doing work (an H100 can sit at 0% utilization with 20 GiB allocated), and most idle time comes from everything around the card, not the card itself. So feed it from the input pipeline, hand it big tensor-friendly shapes, fuse small kernels with torch.compile, use BF16 or FP8, treat LLM serving as a scheduling problem, scale to more GPUs only after one is healthy, and judge it all by real throughput rather than the utilization counter.
DeepCamp AI