WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer

T. Okamoto, H. Yamashita, Y. Ohtani, T. Toda and H. Kawai, "WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer," in Proc. ASRU, Dec. 2023. [IEEE Xplore] [Preprint (PDF)]

Abstract

A recently proposed neural vocoder, Vocos, can perform inference ten times faster than HiFi-GAN because of its use of ConvNeXt layers that can predict high-resolution short-time Fourier transform (STFT) spectra and an inverse STFT layer. To improve synthesis quality while preserving inference speed, this paper proposes an alternative ConvNeXt-based fast neural vocoder, WaveNeXt, in which the inverse STFT layer in Vocos is replaced with a trainable linear layer that can directly predict speech waveform samples without STFT spectra. Additionally, by integrating the JETS-based end-to-end text-to-speech (E2E TTS) framework, E2E TTS models can also be constructed with Vocos and WaveNeXt. Furthermore, full-band models with a sampling frequency of 48~kHz were investigated. The results of experiments for both the analysis-synthesis and E2E TTS conditions demonstrate that the proposed WaveNeXt can achieve higher quality synthesis than Vocos while preserving its inference speed.

The PyTorch source code based on ESPNet2-TTS used in the experiments is available here.



Demo samples

Analysis-synthesis condition (female with fs=24 kHz)
Ground truth
HiFi-GAN V1 HiFi-GAN V2 MS-iSTFT-HiFi-GAN MS-FC-HiFi-GAN
Vocos WaveNeXt (proposed)

Analysis-synthesis condition (male with fs=24 kHz)
Ground truth
HiFi-GAN V1 HiFi-GAN V2 MS-iSTFT-HiFi-GAN MS-FC-HiFi-GAN
Vocos WaveNeXt (proposed)

End-to-end text-to-speech condition (female with fs=24 kHz)
Ground truth
HiFi-GAN V1 MS-iSTFT-HiFi-GAN MS-FC-HiFi-GAN
Vocos WaveNeXt (proposed)

End-to-end text-to-speech condition (male with fs=24 kHz)
Ground truth
HiFi-GAN V1 MS-iSTFT-HiFi-GAN MS-FC-HiFi-GAN
Vocos WaveNeXt (proposed)

Full-band end-to-end text-to-speech condition (female with fs=48 kHz)
Ground truth
HiFi-GAN V1 MS-iSTFT-HiFi-GAN MS-FC-HiFi-GAN
Vocos WaveNeXt (proposed)