all AI news
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
April 17, 2024, 4:46 a.m. | Nandan Thakur, Jianmo Ni, Gustavo Hern\'andez \'Abrego, John Wieting, Jimmy Lin, Daniel Cer
cs.CL updates on arXiv.org arxiv.org
Abstract: There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop SWIM-IR, a synthetic retrieval training dataset containing 33 (high to very-low resource) languages for fine-tuning multilingual dense retrievers without requiring any …
arxiv cs.ai cs.cl cs.ir data languages llms multilingual retrieval training training data type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Developer AI Senior Staff Engineer, Machine Learning
@ Google | Sunnyvale, CA, USA; New York City, USA
Engineer* Cloud & Data Operations (f/m/d)
@ SICK Sensor Intelligence | Waldkirch (bei Freiburg), DE, 79183