all AI news
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
June 14, 2024, 4:42 a.m. | Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng
cs.CL updates on arXiv.org arxiv.org
Abstract: Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and …
abstract accessibility arxiv challenge china cs.cl current data language language models languages large language large language models low multilingual nlp pre-training replace training training data transparent type understanding
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Senior Data Engineer
@ Displate | Warsaw
Senior Principal Software Engineer
@ Oracle | Columbia, MD, United States
Software Engineer for Manta Systems
@ PXGEO | Linköping, Östergötland County, Sweden
DevOps Engineer
@ Teradyne | Odense, DK
LIDAR System Engineer Trainee
@ Valeo | PRAGUE - PRA2
Business Applications Administrator
@ Allegro | Poznań, Poland