May 16, 2024, 4:46 a.m. | Mihailo \v{S}kori\'c, Nikola Jankovi\'c

cs.CL updates on arXiv.org arxiv.org

arXiv:2405.09250v1 Announce Type: new
Abstract: This paper will present textual corpora for Serbian (and Serbo-Croatian), usable for the training of large language models and publicly available at one of the several notable online repositories. Each corpus will be classified using multiple methods and its characteristics will be detailed. Additionally, the paper will introduce three new corpora: a new umbrella web corpus of Serbo-Croatian, a new high-quality corpus based on the doctoral dissertations stored within National Repository of Doctoral Dissertations from …

abstract arxiv cs.cl language language models large language large language models modeling multiple paper repositories textual training type will

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Werkstudent Data Architecture & Governance (w/m/d)

@ E.ON | Essen, DE

Data Architect, Data Lake, Professional Services

@ Amazon.com | Bogota, DC, COL

Data Architect, Data Lake, Professional Services

@ Amazon.com | Buenos Aires City, Buenos Aires Autonomous City, ARG

Data Architect

@ Bitful | United States - Remote