Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

📰 ArXiv cs.AI

arXiv:2510.12834v3 Announce Type: replace-cross Abstract: Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina support

Published 30 Mar 2026
Read full paper → ← Back to News