T. Okamoto, T. Toda, and H. Kawai, "E2E-S2S-VC: End-to-end sequence-to-sequence voice conversion," in Proc. Interspeech, Aug. 2023, pp. 2043–2047. [ISCA Archive][Open access (PDF)]
Highlights
1. End-to-end (E2E) non-autoregressive (non-AR) sequence-to-sequence (S2S) voice conversion (VC) models, in which the voice of a source speaker is directly converted to that of a target speaker with a single neural network, have been proposed by introducing E2E text-to-speech (TTS) models, VITS and JETS.
2. E2E-S2S-VC models, VITS-VC and JETS-VC can be successfully trained with stable monotonic alignment search by introducing reduction factor only for encoder.
3. The proposed JETS-VC outperforms pipeline non-AR S2S VC models (CFS2+PWG) in terms of the conversion quality and inference speed.
The PyTorch source code of E2E-S2S-VC models for ESPnet2 including a recipe for Hi-Fi-CAPTAIN datasets
and CMU ARCTIC databases is open-sourced here.
[Converted speech samples of CMU ARCTIC databases]
Male to female conversion (Hi-Fi-CAPTAIN for Japanese)
Text (in Japanese): わたしはバスは時間が不安なので歩いています.
Source
Target
CFS2+PWG
CFS2'+HiFi-GAN (joint finetuning)
CFS2'+HiFi-GAN (joint training from scratch)
VITS-VC (melspc, re=2)
JETS-VC (melspc, re=2)
JETS-VC (melspc, re=3)
JETS-VC (linear, re=2)
Text (in Japanese): 残念ながら,日本で最も人気のスポーツとは言えないですね.
Source
Target
CFS2+PWG
CFS2'+HiFi-GAN (joint finetuning)
CFS2'+HiFi-GAN (joint training from scratch)
VITS-VC (melspc, re=2)
JETS-VC (melspc, re=2)
JETS-VC (melspc, re=3)
JETS-VC (linear, re=2)
Text (in Japanese): 鈴木さん,あそこまで歩けますか?
Source
Target
CFS2+PWG
CFS2'+HiFi-GAN (joint finetuning)
CFS2'+HiFi-GAN (joint training from scratch)
VITS-VC (melspc, re=2)
JETS-VC (melspc, re=2)
JETS-VC (melspc, re=3)
JETS-VC (linear, re=2)
Female to male conversion (Hi-Fi-CAPTAIN for Japanese)