LLM Optimization KV Cache Flash Attention MQA GQA | Hugging Face Explained

Name: LLM Optimization KV Cache Flash Attention MQA GQA | Hugging Face Explained
Uploaded: 2026-03-18T08:42:19+00:00
Channel: Switch 2 AI
Description: In this video, we explore advanced optimization techniques used in modern Transformer and LLM models to improve speed, reduce memory usage, and make lar...

Switch 2 AI · Beginner ·🧠 Large Language Models ·1w ago

In this video, we explore advanced optimization techniques used in modern Transformer and LLM models to improve speed, reduce memory usage, and make large scale models practical for real world applications. Here is the GitHub repo link https://github.com/switch2ai You can download all the code, scripts, and documents from the above GitHub repository. We start with one of the most important concepts used during inference. KV Cache Key Value Cache During text generation, models generate output token by token. At every step, attention requires key and value vectors from all previous tokens. …

Watch on YouTube ↗ (saves to browser)

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)