all AI news
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
March 5, 2024, 2:52 p.m. | Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto Joyoadikusumo
cs.CL updates on arXiv.org arxiv.org
Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous …
abstract arxiv challenges code cs.cl diverse diversity indonesia landscape language language models languages low making modern multilingual practice regional teaching type world
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York