all AI news
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Feb. 26, 2024, 5:42 a.m. | Aaditya K. Singh, DJ Strouse
cs.LG updates on arXiv.org arxiv.org
Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and …
abstract arxiv biases cs.cl cs.lg encoding impact inductive language language model large language large language model llm llms pipeline text tokenization tokens type
More from arxiv.org / cs.LG updates on arXiv.org
Trainwreck: A damaging adversarial attack on image classifiers
1 day, 15 hours ago |
arxiv.org
Fast Controllable Diffusion Models for Undersampled MRI Reconstruction
1 day, 15 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Senior Machine Learning Engineer
@ GPTZero | Toronto, Canada
Sr. Data Operations
@ Carousell Group | West Jakarta, Indonesia
Senior Analyst, Business Intelligence & Reporting
@ Deutsche Bank | Bucharest
Business Intelligence Subject Matter Expert (SME) - Assistant Vice President
@ Deutsche Bank | Cary, 3000 CentreGreen Way
Enterprise Business Intelligence Specialist
@ NAIC | Kansas City
Senior Business Intelligence (BI) Developer - Associate
@ Deutsche Bank | Cary, 3000 CentreGreen Way