all AI news
[R] A New Massive Multilingual Dataset for High-Performance Language Technologies
April 8, 2024, 8:56 a.m. | /u/SeawaterFlows
Machine Learning www.reddit.com
**Project page**: [https://hplt-project.org/](https://hplt-project.org/)
**Datasets**: [https://hplt-project.org/datasets/v1.2](https://hplt-project.org/datasets/v1.2)
**GitHub**: [https://github.com/hplt-project](https://github.com/hplt-project)
**Abstract**:
>We present the **HPLT** (**High Performance Language Technologies**) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a …
abstract acquisition bilingual collection computing data dataset internet language language resources machinelearning management massive multilingual performance processing resources software technologies tools web
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Business Data Analyst
@ Alstom | Johannesburg, GT, ZA