Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus with FIRNet Source-Filter Neural Vocoder

T. Okamoto, Y. Ohtani, S. Shimizu, T. Toda, and H. Kawai, "Challenge of singing voice synthesis using only text-to-speech corpus with FIRNet source-filter neural vocoder," in Proc. Interspeech, Sept. 2024, pp. 1870–1874. [ISCA Archive]

[Musical scores used in experiments (csv)]


Text-to-speech (TTS)

Text (in Japanese): 水をマレーシアから買わなくてはならないのです.
Original
CFS2 + HiFi-GAN (melspc) CFS2 + HiFi-GAN (WORLD) PESC + HiFi-GAN (WORLD)
CFS2 + FIRNet (WORLD) PESC + FIRNet (WORLD)

Singing voice synthesis (SVS)

Lyrics (in Japanese): でんでんむしむしかたつむり
Original
CFS2 + HiFi-GAN (melspc) CFS2 + HiFi-GAN (WORLD) PESC + HiFi-GAN (WORLD) PESC + HiFi-GAN + Input fo shift (WORLD)
CFS2 + FIRNet (WORLD) PESC + FIRNet (WORLD) PESC + FIRNet + Input fo shift (WORLD)
PESC + FIRNet + Input fo shift + Direct fo input (WORLD) Not included in experiments
T x 0.5: HiFi-GAN (WORLD) T x 0.5: FIRNet (WORLD)
fo x 0.5: HiFi-GAN (WORLD) fo x 0.5: FIRNet (WORLD)
fo x 2.0: HiFi-GAN (WORLD) fo x 2.0: FIRNet (WORLD)


Lyrics (in Japanese): げんこつやまのたぬきさん
Original
CFS2 + HiFi-GAN (melspc) CFS2 + HiFi-GAN (WORLD) PESC + HiFi-GAN (WORLD) PESC + HiFi-GAN + Input fo shift (WORLD)
CFS2 + FIRNet (WORLD) PESC + FIRNet (WORLD) PESC + FIRNet + Input fo shift (WORLD)
PESC + FIRNet + Input fo shift + Direct fo input (WORLD) Not included in experiments
T x 0.5: HiFi-GAN (WORLD) T x 0.5: FIRNet (WORLD)
fo x 0.5: HiFi-GAN (WORLD) fo x 0.5: FIRNet (WORLD)
fo x 2.0: HiFi-GAN (WORLD) fo x 2.0: FIRNet (WORLD)