Speech LLMs: Models that listen and talk back

Efficient NLP · Beginner ·🧠 Large Language Models ·1y ago
Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io Speech LLMs (or speech foundation models) are models that combine the reasoning and knowledge capabilities of large language models (LLMs) with the ability to process speech / audio input and output natively. Unlike traditional cascade models that convert speech to text and back, these end-to-end models handle speech directly. Learn about components of these systems, including the speech encoder, LLM, and vocoder, and the most popular models for each stage. We'll also explore how these components wor…
Watch on YouTube ↗ (saves to browser)

Chapters (10)

Intro
0:39 Limitations of Cascading Models
1:57 Components of a Speech LLM
3:08 Speech Encoder
4:41 Large Language Model (LLM)
6:21 Length Adaptation
7:59 Vocoder Model
9:09 LLaMA-Omni Case Study
10:14 Training LLaMA-Omni
11:06 Google Gemini Models
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)