MOSS-Audio Technical Report

📰 ArXiv cs.AI

arXiv:2606.01802v1 Announce Type: cross Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates au

Published 2 Jun 2026
Read full paper → ← Back to Reads