Efficient Memory Management for LLM serving

West Coast Machine Learning · Advanced ·📐 ML Fundamentals ·8mo ago

Skills: LLM Engineering80%

Key Takeaways

Explores efficient memory management techniques for Large Language Model serving

Original Description

In this meetup, Neha led our discussion of the paper, Efficient Memory Management for LLM Serving. Our Meetup: https://www.meetup.com/East-Bay-Tri-Valley-Machine-Learning-Meetup/ *Content* 00:00 Intro 09:48 Memory usage 21:50 Cache mgmt 32:11 Challenges 36:00 Paged attention 47:46 Sampling 49:24 Beam search 53:00 Memory mgmt. 58:00 Kernel opt ============================ 😊About Us West Coast Machine Learning is a channel dedicated to exploring the exciting world of machine learning and AI! Our group of techies is passionate about AI, deep learning, neural networks, computer vision, tiny ML, and other cool geeky machine learning topics. We love to dive deep into the technical details and stay up to date with the latest research developments. Our Meetup group and YouTube channel is the perfect place to connect with other like-minded individuals who share your love of machine learning. We offer a mix of research paper discussions, coding reviews, and other data science topics. So, if you're looking to stay up to date with the latest developments in machine learning, connect with other techies, and learn something new, be sure to subscribe to our channel and join our Meetup community today! Meetup: https://www.meetup.com/east-bay-tri-valley-machine-learning-meetup/ ============================= #llms #llm-memory-mgmt #llm-memory-usage #llm-serving

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

Chunking Done Right: Normalization, sentence boundaries, and overlap

Master chunking techniques to improve retrieval pipeline performance and avoid common pitfalls

Medium · Programming

Why Materials Scientists Are Still Copy-Pasting Data from PDFs in 2026 (And Why AI Changes…

Materials scientists still copy-paste data from PDFs, but AI can change this tedious task

Medium · Machine Learning

From Python Slop to 4µs Rust: How We Accelerated Market Microstructure Simulations by 25,000x

Accelerate market microstructure simulations by 25,000x by migrating from Python to Rust, learning how to optimize performance-critical code

Medium · Data Science

Crafting the Optimal Path: A Deep-Dive Evaluation of Informed vs.

Learn to evaluate and optimize grid-based pathfinding algorithms for informed and uninformed searches in Python

Medium · Python

Chapters (9)

Intro

9:48 Memory usage

21:50 Cache mgmt

32:11 Challenges

36:00 Paged attention

47:46 Sampling

49:24 Beam search

53:00 Memory mgmt.

58:00 Kernel opt

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB