March 28, 2024, 4:48 a.m. | Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.18140v1 Announce Type: new
Abstract: The high computational cost associated with pretraining large language models limits their research. Two strategies have emerged to address this issue: domain specialization and pretraining with high-quality data. To explore these strategies, we specialized the Sabi\'a-2 Small model with 1.9 billion unique tokens from reputable Brazilian legal sources and conducted few-shot evaluations on legal and general knowledge exams. Our model, Juru, demonstrates the benefits of domain specialization with a reduced amount of pretraining data. However, …

abstract arxiv billion computational cost cs.ai cs.cl data domain explore issue language language model language models large language large language model large language models legal pretraining quality quality data research small strategies tokens type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Machine Learning Research Scientist

@ d-Matrix | San Diego, Ca