Speculative decoding question, 665% speed increase

📰 Reddit r/LocalLLaMA

Im using these settings in llama.cpp: --spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 Whats the real reason for lets say the prompt is for "minor changes in code", whats differing between models: Gemma 4 31b: Doubles in tks gen so 100% Qwen 3.6: Only 40% more speed Devstrall small: 665% increase in speed (what?) EDIT: added --repeat-penalty 1.0 and --spec-type ngram-mod instead for Qwen 3.6, now sp

Published 19 Apr 2026
Read full article → ← Back to Reads