Show HN: Speeding up LLM inference 2x times (possibly)

📰 Hacker News · kolinko

Here's a project I've been working on for the last few months. It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model. It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results. I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses. The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization. You could call it ad-hoc model distillation :) You can change the speed / accuracy of a model at will, in real time. Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory.

Published 17 Apr 2024

Read full article → ← Back to Reads