all AI news
[R] CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages - 6.3 trillion tokens
Sept. 19, 2023, 8:10 a.m. | /u/InterviewIntrepid889
Machine Learning www.reddit.com
Hugging Face datasets: [https://huggingface.co/datasets/uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)
Abstract:
>The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing …
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
Senior AI/ML Developer
@ Lemon.io | Remote
Consultant(e) Confirmé(e) Power BI & Azure - H/F
@ Talan | Lyon, France
Research Manager-Data Science
@ INFICON | East Syracuse, NY, United States
Data Scientist
@ Ubisoft | Singapore, Singapore
Data Science Assistant – Stage Janvier 2024 (F/H/NB)
@ Ubisoft | Paris, France
Data Scientist
@ dentsu international | Milano, Italy