March 5, 2024, 2:52 p.m. | Wilson Wongso, David Samuel Setiawan, Steven Limcorn, Ananto Joyoadikusumo

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.01817v1 Announce Type: new
Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained language models. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous …

abstract arxiv challenges code cs.cl diverse diversity indonesia landscape language language models languages low making modern multilingual practice regional teaching type world

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York