March 19, 2024, 4:44 a.m. | Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L\"ubbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze

cs.LG updates on arXiv.org arxiv.org

arXiv:2310.08754v4 Announce Type: replace
Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and …

abstract architectures arxiv blind blind spot cs.lg dataset influence language language models large language large language models light llm llms pretraining scaling spot success training type

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US