Jan. 10, 2022, 2:10 a.m. | Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen, Xiaojuan Ma,

cs.CL updates on arXiv.org arxiv.org

Automatic speech recognition (ASR) on low resource languages improves access
of linguistic minorities to technological advantages provided by Artificial
Intelligence (AI). In this paper, we address a problem of data scarcity of Hong
Kong Cantonese language by creating a new Cantonese dataset. Our dataset,
Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read
speech paired with transcripts, collected from Cantonese audiobooks from Hong
Kong. It combines philosophy, politics, education, culture, lifestyle and
family domains, covering a wide range …

arxiv dataset datasets language speech speech recognition survey

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Social Insights & Data Analyst (Freelance)

@ Media.Monks | Jakarta

Cloud Data Engineer

@ Arkatechture | Portland, ME, USA