MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China | allainews.com

June 14, 2024, 4:42 a.m. | Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng

cs.CL updates on arXiv.org arxiv.org

arXiv:2311.08348v2 Announce Type: replace
Abstract: Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and …

abstract accessibility arxiv challenge china cs.cl current data language language models languages large language large language models low multilingual nlp pre-training replace training training data transparent type understanding

More from arxiv.org / cs.CL updates on arXiv.org

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach 15 hours ago | arxiv.org

abstract algorithms analysis arxiv +22

Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis Generation 15 hours ago | arxiv.org

abstract applications arxiv cs.ai +13

LLM-SQL-Solver: Can LLMs Determine SQL Equivalence? 15 hours ago | arxiv.org

abstract applications arxiv community +24

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models 15 hours ago | arxiv.org

abstract advantages alignment arxiv +22

Exploring ChatGPT's Capabilities on Vulnerability Management 15 hours ago | arxiv.org

abstract analysis arxiv attention +22

Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction 15 hours ago | arxiv.org

action arxiv cs.cl cs.cv +9

Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation 15 hours ago | arxiv.org

abstract adapt arxiv capability +19

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models 15 hours ago | arxiv.org

arxiv cs.ai cs.cl cs.cv +13

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs 15 hours ago | arxiv.org

arxiv bootstrapping cs.cl cs.cv +5

Senior Data Engineer

@ Displate | Warsaw

View on ai-jobs.net

Senior Principal Software Engineer

@ Oracle | Columbia, MD, United States

View on ai-jobs.net

Software Engineer for Manta Systems

@ PXGEO | Linköping, Östergötland County, Sweden

View on ai-jobs.net

DevOps Engineer

@ Teradyne | Odense, DK

View on ai-jobs.net

LIDAR System Engineer Trainee

@ Valeo | PRAGUE - PRA2

View on ai-jobs.net

Business Applications Administrator

@ Allegro | Poznań, Poland

View on ai-jobs.net