FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter


Yamato Ohtani1, Takuma Okamoto1, Tomoki Toda1,2, and Hisashi Kawai1
1 National Institute of Information and Communications Technology, Japan
2 Information Technology Center, Nagoya University, Japan



Abstract

Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.



Preprint

Camera-ready version is available here.



Samples

Waveforms from FIRNet (Please bewawre of loud volume))

Ground truth FIRNet
Mixed
excitation
N/A

ext

Residual
signal

gt_res

pred_res

Speech
waveform

gt_speech

pred_speech


Comparison to conventional methods

Note: samples in blue columns (listed as follows) are not included in our ICASSP paper.

f0 scaling condition: × 1.00

Original WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 0.00

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 0.25

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 0.50

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 2.00

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 4.00

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)

f0 scaling condition: × 8.00

WORLD SiFi-GAN
(train with 1000 utt.)
FIRNet
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 1000 utt.)
FIRNet
w/ UnivNet disc.
(train with 18858 utt.)