April 19, 2024, 4:47 a.m. | Geyu Lin, Bin Wang, Zhengyuan Liu, Nancy F. Chen

cs.CL updates on arXiv.org arxiv.org

arXiv:2404.11932v1 Announce Type: new
Abstract: Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared …

abstract alignment arxiv challenge cross-lingual cs.ai cs.cl data distribution english knowledge language language models languages large language large language models llms multilingual performance pre-training training training data type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

#13721 - Data Engineer - AI Model Testing

@ Qualitest | Miami, Florida, United States

Elasticsearch Administrator

@ ManTech | 201BF - Customer Site, Chantilly, VA