April 8, 2024, 8:56 a.m. | /u/SeawaterFlows

Machine Learning www.reddit.com

**Paper**: [https://arxiv.org/abs/2403.14009](https://arxiv.org/abs/2403.14009)

**Project page**: [https://hplt-project.org/](https://hplt-project.org/)

**Datasets**: [https://hplt-project.org/datasets/v1.2](https://hplt-project.org/datasets/v1.2)

**GitHub**: [https://github.com/hplt-project](https://github.com/hplt-project)

**Abstract**:

>We present the **HPLT** (**High Performance Language Technologies**) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a …

abstract acquisition bilingual collection computing data dataset internet language language resources machinelearning management massive multilingual performance processing resources software technologies tools web

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Business Data Analyst

@ Alstom | Johannesburg, GT, ZA