all AI news
Tokenizer Choice For LLM Training: Negligible or Crucial?
March 19, 2024, 4:44 a.m. | Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L\"ubbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze
cs.LG updates on arXiv.org arxiv.org
Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and …
abstract architectures arxiv blind blind spot cs.lg dataset influence language language models large language large language models light llm llms pretraining scaling spot success training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US