NICT-Tib1: A public speech corpus of Lhasa dialect for benchmarking Tibetan language speech recognition systems developed by NICT
What's new
- 2024/8/27 has been released.
Overview
This dataset is released by NICT under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This dataset is an audio corpus consisting of recorded audio of Tibetan, a low-resource language, and its transcriptions. It consists of 33.5 hours of audio of a man and a woman (8 men and 12 women, 15-30 years old) reading out a Tibetan news manuscript, and its transcriptions. As for the speakers, it contains sound native speakers who regularly use the Lhasa dialect of Tibetan. In research and development of speech recognition, a test set for evaluating its performance and training data for model learning are essential. This database can be used as both a test set and training data.
Download
Extracted directory
The files have been compressed in 'zip' format (~3.0G). The extracted directory should look like the following.
------------------------------------------------------------------------------------------- Tibetan/ data/ speaker-id/ speaker-session-id/ wave-files wav.scp (kaldi format) label.txt (kaldi format) README -------------------------------------------------------------------------------------------
Citation
Please cite the following when using the corpus.
------------------------------------------------------------------------------------------- @INPROCEEDINGS{nict-tib1, author={Soky, Kak and Gong, Zhuo and Li, Sheng}, booktitle={Proc. O-COCOSDA}, title={Nict-Tib1: A Public Speech Corpus Of Lhasa Dialect For Benchmarking Tibetan Language Speech Recognition Systems}, year={2022}, pages={1-5}, doi={10.1109/O-COCOSDA202257103.2022.9997917}} -------------------------------------------------------------------------------------------