all AI news
Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset. (arXiv:2201.02419v1 [cs.CL])
cs.CL updates on arXiv.org arxiv.org
Automatic speech recognition (ASR) on low resource languages improves access
of linguistic minorities to technological advantages provided by Artificial
Intelligence (AI). In this paper, we address a problem of data scarcity of Hong
Kong Cantonese language by creating a new Cantonese dataset. Our dataset,
Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read
speech paired with transcripts, collected from Cantonese audiobooks from Hong
Kong. It combines philosophy, politics, education, culture, lifestyle and
family domains, covering a wide range …
arxiv dataset datasets language speech speech recognition survey