June 14, 2024, 4:42 a.m. | Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng

cs.CL updates on arXiv.org arxiv.org

arXiv:2311.08348v2 Announce Type: replace
Abstract: Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and …

abstract accessibility arxiv challenge china cs.cl current data language language models languages large language large language models low multilingual nlp pre-training replace training training data transparent type understanding

Senior Data Engineer

@ Displate | Warsaw

Senior Principal Software Engineer

@ Oracle | Columbia, MD, United States

Software Engineer for Manta Systems

@ PXGEO | Linköping, Östergötland County, Sweden

DevOps Engineer

@ Teradyne | Odense, DK

LIDAR System Engineer Trainee

@ Valeo | PRAGUE - PRA2

Business Applications Administrator

@ Allegro | Poznań, Poland