all AI news
A New Massive Multilingual Dataset for High-Performance Language Technologies
March 22, 2024, 4:47 a.m. | Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Ba\~n\'on, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ram\'ire
cs.CL updates on arXiv.org arxiv.org
Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word …
abstract acquisition arxiv bilingual cs.cl data dataset internet language language resources management massive multilingual performance processing resources technologies type web
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US