MOSS-TTS Technical Report

📰 ArXiv cs.AI

MOSS-TTS is a speech generation foundation model built on discrete audio tokens and autoregressive modeling

advanced Published 23 Mar 2026

Action Steps

Utilize MOSS-Audio-Tokenizer for causal Transformer tokenization of audio
Apply autoregressive modeling for speech generation
Leverage large-scale pretraining for improved performance
Explore applications of MOSS-TTS in speech synthesis and audio processing

Who Needs to Know This

AI engineers and researchers on a team can benefit from MOSS-TTS as it provides a scalable recipe for speech generation, while product managers can explore its applications in various products

Key Insight

💡 MOSS-TTS provides a scalable recipe for speech generation using discrete audio tokens and autoregressive modeling

Key Takeaways

MOSS-TTS is a speech generation foundation model built on discrete audio tokens and autoregressive modeling

Full Article

Title: MOSS-TTS Technical Report

Abstract:
arXiv:2603.18090v2 Announce Type: replace-cross Abstract: This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simp

Read full paper → ← Back to Reads