all AI news
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
March 21, 2024, 4:48 a.m. | Meet Doshi, Raj Dabre, Pushpak Bhattacharyya
cs.CL updates on arXiv.org arxiv.org
Abstract: In this paper, we explore the utility of \textit{Translationese} as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train …
abstract arxiv building cs.cl data english explore language language models languages lms machine machine translation paper pre-training synthetic synthetic data training translation type utility vast
More from arxiv.org / cs.CL updates on arXiv.org
Benchmarking LLMs via Uncertainty Quantification
1 day, 19 hours ago |
arxiv.org
CARE: Extracting Experimental Findings From Clinical Literature
1 day, 19 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Director, Clinical Data Science
@ Aura | Remote USA
Research Scientist, AI (PhD)
@ Meta | Menlo Park, CA | New York City