Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

📰 ArXiv cs.AI

Sommelier is a scalable open multi-turn audio pre-processing system for full-duplex speech language models

advanced Published 30 Mar 2026

Action Steps

Developing full-duplex speech language models requires high-quality multi-speaker conversational data
Existing large-scale resources are predominantly single-speaker or limited in volume
Sommelier addresses the complex dynamics of natural dialogue by providing scalable open multi-turn audio pre-processing
Practitioners can apply Sommelier to improve the performance of their speech language models

Who Needs to Know This

AI engineers and researchers working on speech language models can benefit from Sommelier as it enables real-time natural human-computer interaction, and data scientists can utilize it for high-quality multi-speaker conversational data

Key Insight

💡 Sommelier addresses the scarcity of high-quality multi-speaker conversational data for full-duplex speech language models