(ICASSP 2020) Transformer-based neural text-to-speech with weighted forced attention

https://ieeexplore.ieee.org/document/9053915

Demo samples

Synthesized speech samples used in experiments
Original
STRAIGHT (analysis-synthesis)
WaveGlow (analysis-synthesis)
(A) Tacotron 2 + WaveGlow
(B) Transformer (FNN) + WaveGlow
(C) Transformer (Conv1D) + WaveGlow
(D) BLSTM + WaveGlow
(E) BLSTM+Taco2dec + WaveGlow
(F) Proposed (0.2) + WaveGlow
(G) Proposed (0.5) + WaveGlow
(H) Proposed (0.7) + WaveGlow
(I) Proposed (1.0) + WaveGlow
(J) FastSpeech (Default) + WaveGlow
(K) FastSpeech (w/o-DP)+ WaveGlow
(L) FastSpeech (Simple) + WaveGlow

Synthesized speech samples used in additional experiments (not included in proceedings)
Original
STRAIGHT (analysis-synthesis)
WaveGlow (analysis-synthesis)
WaveGlow 256 ch (analysis-synthesis)
Parallel WaveGAN (analysis-synthesis)
BLSTM+Taco2dec (only phoneme) + WaveGlow
Transformer (only phoneme) + WaveGlow
BLSTM + WaveGlow
Tacotron 2 + WaveGlow
BLSTM+Taco2dec + WaveGlow
Transformer + WaveGlow
Transformer + WaveGlow 256 ch
Transformer + Paralle WaveGAN
FastSpeech (Default) + WaveGlow
FastSpeech (w/o-DP) + WaveGlow