Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus with FIRNet Source-Filter Neural Vocoder
T. Okamoto, Y. Ohtani, S. Shimizu, T. Toda, and H. Kawai, "Challenge of singing voice synthesis using only text-to-speech corpus with FIRNet source-filter neural vocoder," in Proc. Interspeech, Sept. 2024, pp. 1870–1874. [ISCA Archive] [Musical scores used in experiments (csv)]
Text-to-speech (TTS)
Text (in Japanese): 水をマレーシアから買わなくてはならないのです.
Original
CFS2 + HiFi-GAN (melspc)
CFS2 + HiFi-GAN (WORLD)
PESC + HiFi-GAN (WORLD)
CFS2 + FIRNet (WORLD)
PESC + FIRNet (WORLD)
Singing voice synthesis (SVS)
Lyrics (in Japanese): でんでんむしむしかたつむり
Original
CFS2 + HiFi-GAN (melspc)
CFS2 + HiFi-GAN (WORLD)
PESC + HiFi-GAN (WORLD)
PESC + HiFi-GAN + Input fo shift (WORLD)
CFS2 + FIRNet (WORLD)
PESC + FIRNet (WORLD)
PESC + FIRNet + Input fo shift (WORLD)
PESC + FIRNet + Input fo shift + Direct fo input (WORLD) Not included in experiments
T x 0.5: HiFi-GAN (WORLD)
T x 0.5: FIRNet (WORLD)
fo x 0.5: HiFi-GAN (WORLD)
fo x 0.5: FIRNet (WORLD)
fo x 2.0: HiFi-GAN (WORLD)
fo x 2.0: FIRNet (WORLD)
Lyrics (in Japanese): げんこつやまのたぬきさん
Original
CFS2 + HiFi-GAN (melspc)
CFS2 + HiFi-GAN (WORLD)
PESC + HiFi-GAN (WORLD)
PESC + HiFi-GAN + Input fo shift (WORLD)
CFS2 + FIRNet (WORLD)
PESC + FIRNet (WORLD)
PESC + FIRNet + Input fo shift (WORLD)
PESC + FIRNet + Input fo shift + Direct fo input (WORLD) Not included in experiments