Your GPU Is Probably Idle

📰 Hackernoon

A GPU holding memory isn't the same as a GPU doing work (an H100 can sit at 0% utilization with 20 GiB allocated), and most idle time comes from everything around the card, not the card itself. So feed it from the input pipeline, hand it big tensor-friendly shapes, fuse small kernels with torch.compile, use BF16 or FP8, treat LLM serving as a scheduling problem, scale to more GPUs only after one is healthy, and judge it all by real throughput rather than the utilization counter.

Published 9 Jun 2026

Read full article → ← Back to Reads