all AI news
Multilingual Open Text 1.0: Public Domain News in 44 Languages. (arXiv:2201.05609v1 [cs.CL])
Jan. 17, 2022, 2:10 a.m. | Chester Palen-Michel, June Kim, Constantine Lignos
cs.CL updates on arXiv.org arxiv.org
We present a new multilingual corpus containing text in 44 languages, many of
which have relatively few existing resources for natural language processing.
The first release of the corpus contains over 2.7 million news articles and 1
million shorter passages published between 2001--2021, collected from Voice of
America news websites. We describe our process for collecting, filtering, and
processing the data. The source material is in the public domain, our
collection is licensed using a creative commons license (CC BY …
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Senior ML Researcher - 3D Geometry Processing | 3D Shape Generation | 3D Mesh Data
@ Promaton | Europe
Software Engineer, Data Platforms
@ Whatnot | San Francisco, CA, Los Angeles, CA, New York City, Phoenix, AZ, Seattle, WA, Denver, CO
Staff Data Engineer, Data Platform
@ Lilt | Indianapolis
Business Data Analyst - New Division
@ Breakthru Beverage Group | Toronto, ON, Canada
Data Operations Associate
@ iCapital | New York City, United States
Senior Data Scientist, R&D
@ Plusgrade | Toronto, Ontario