Katakana-based TTS Model Training: Initially trains a Japanese TTS acoustic model using katakana sequence input, without relying on accentual labels.
Mora-level \(f_{\mathrm{o}}\) Extraction: Uses forced alignment to extract mora-level fundamental frequency (\(f_{\mathrm{o}}\)) values from the training data, capturing prosodic information.
BERT Fine-tuning for \(f_{\mathrm{o}}\) Prediction: Fine-tunes a pre-trained Japanese BERT model to predict mora-level \(f_{\mathrm{o}}\) values, using word sequences (including kanji) and their katakana representations as input.
Integration of Predicted \(f_{\mathrm{o}}\) in TTS: During inference, inputs the BERT-predicted mora-level \(f_{\mathrm{o}}\) along with katakana sequences into the TTS acoustic model, enabling prosodically correct synthesis.
Label-Free Prosody Prediction: Achieves high-quality prosody prediction without requiring explicit accentual labels, reducing reliance on costly manual annotations.
Competitive Performance: Demonstrates synthesis quality and accent correctness comparable to or surpassing conventional neural TTS models that use explicit accentual labels.
Potential for Scalability: Shows promise for application to larger datasets and adaptation to multi-speaker TTS systems, with possible extensions to other languages and prosodic features.
Speech Synthesis Comparison
Note on Text Highlighting
In the audio samples below, text highlighted in red indicates portions where the prosody (intonation, stress, or rhythm) of the synthesized speech differs from the expected natural pronunciation.
Kanji Homophones: Accent Disambiguation
Examples of words with identical pronunciation but different meanings based on kanji
In this sample, the word 'ズキンズキン' (zukinzukin) is not present in the training data, which affects the Katakana-BERT model's ability to process it appropriately. This limitation may impact the accuracy of accent prediction for this particular expression.