The non-autoregressive decoder won CPU neural TTS - benchmarks across Piper, MeloTTS, Kokoro, Parler-TTS, XTTSv2
📰 Reddit r/deeplearning
Ran a comparison of five contemporary neural TTS models on CPU only (8 cores, no GPU), using identical test phrases and measuring real-time factor (RTF = synthesis_time / audio_duration). What the numbers look like: Piper Low (5.8MB, VITS/ONNX) — RTF ~0.0007 (1409x real-time) Piper Medium (62MB, VITS/ONNX) — RTF ~0.0004 (2483x) Piper High (110MB, VITS/ONNX) — RTF ~0.00013 (7603x) MeloTTS (162MB, VITS + BERT embeddin
DeepCamp AI