Inside the Matrix: How does matrix multiplication work inside GPUs?

DeepLearning Hero · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

This video teaches matrix multiplication inside GPUs, the core computation powering deep neural networks and large language models

Original Description

In this video, we dive into the mechanics of a GPU and learn how they perform matrix multiplication; the core computation powering deep neural networks and large language models. By the end of the video you'll learn, an efficient formulation of matrix multiplication, computing matrix multiplication with tiling and kernel fusion. GEMM basics: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html CUDA linear algebra: https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ A100 specifications: https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ 00:00 - Introduction 02:40 - GEMM basics 03:24 - Naive implementation of matmul 04:19 - GPU memory hierarchy 05:34 - Memory thrashing of GPUs 06:00 - Memory efficient implementation of matmul 06:33 - Matmul with tiling 08:17 - GPU execution hierarchy 09:25 - Magic of power of 2 10:15 - Tile quantization 11:14 - Kernel fusion 12:24 - Conclusion
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related Reads

📰
The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons
Learn about the MMM Data Model for knowledge interoperability in decentralised systems and how it enables flexible knowledge structuring and sharing
ArXiv cs.AI
📰
Constructing Epistemic AI Literacy: Detecting Epistemic Aims and Processes in Student-AI Co-Programming
Learn to detect epistemic aims and processes in student-AI co-programming to improve AI literacy, crucial for effective learning with generative AI
ArXiv cs.AI
📰
From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents
Discover how memory architecture impacts language emergence in LLM agents and learn to design effective memory systems for agent communication
ArXiv cs.AI
📰
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
Learn about Seed2.0, a model series tackling real-world complexity by identifying user needs and constructing a reliable evaluation system
ArXiv cs.AI

Chapters (12)

Introduction
2:40 GEMM basics
3:24 Naive implementation of matmul
4:19 GPU memory hierarchy
5:34 Memory thrashing of GPUs
6:00 Memory efficient implementation of matmul
6:33 Matmul with tiling
8:17 GPU execution hierarchy
9:25 Magic of power of 2
10:15 Tile quantization
11:14 Kernel fusion
12:24 Conclusion
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →