all AI news
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Feb. 26, 2024, 5:42 a.m. | Aaditya K. Singh, DJ Strouse
cs.LG updates on arXiv.org arxiv.org
Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and …
abstract arxiv biases cs.cl cs.lg encoding impact inductive language language model large language large language model llm llms pipeline text tokenization tokens type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Intern Large Language Models Planning (f/m/x)
@ BMW Group | Munich, DE
Data Engineer Analytics
@ Meta | Menlo Park, CA | Remote, US