Tokenizer Choice For LLM Training: Negligible or Crucial? | allainews.com

March 19, 2024, 4:44 a.m. | Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L\"ubbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze

cs.LG updates on arXiv.org arxiv.org

arXiv:2310.08754v4 Announce Type: replace
Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and …

abstract architectures arxiv blind blind spot cs.lg dataset influence language language models large language large language models light llm llms pretraining scaling spot success training type

More from arxiv.org / cs.LG updates on arXiv.org

Transforming gradient-based techniques into interpretable methods 22 hours ago | arxiv.org

abstract arxiv challenges cnn +20

ChatQA: Surpassing GPT-4 on Conversational QA and RAG 22 hours ago | arxiv.org

arxiv conversational cs.ai cs.cl +7

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers 22 hours ago | arxiv.org

abstract arxiv cs.ai cs.cv +22

Calibrating Wireless Ray Tracing for Digital Twinning using Local Phase Error Estimates 22 hours ago | arxiv.org

abstract access arxiv construct +22

Graph Network Surrogate Model for Subsurface Flow Optimization 22 hours ago | arxiv.org

abstract arxiv co2 cs.lg +16

Double Machine Learning for Static Panel Models with Fixed Effects 22 hours ago | arxiv.org

abstract advances algorithms arxiv +20

Dynamic Adversarial Attacks on Autonomous Driving Systems 22 hours ago | arxiv.org

abstract adversarial adversarial attacks arxiv +22

BioCLIP: A Vision Foundation Model for the Tree of Life 22 hours ago | arxiv.org

arxiv cs.cl cs.cv cs.lg +7

On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms 22 hours ago | arxiv.org

abstract algorithms arxiv building +12

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

View on ai-jobs.net

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net