Inference Characteristics of Streaming Speech Recognition

Efficient NLP · Beginner ·🧠 Large Language Models ·7mo ago
Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io Deploying streaming speech recognition (ASR) is quite different from serving an LLM. In this video, I break down the unique challenges, architecture, and surprising behaviors of Kyutai’s Moshi streaming ASR model and its Rust inference server: GPU/CPU memory patterns and the economics of streaming vs. batch ASR. Kyutai STT offline in browser: https://github.com/lucky-bai/wasm-speech-streaming 0:00 - Introduction 1:50 - Kyutai Moshi architecture 4:03 - Short quiz 4:45 - CPU and GPU memory experiments…
Watch on YouTube ↗ (saves to browser)

Chapters (7)

Introduction
1:50 Kyutai Moshi architecture
4:03 Short quiz
4:45 CPU and GPU memory experiments
6:08 Streaming inference server
8:56 Cost calculation
10:10 Differences vs LLM inference
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)