Feb. 26, 2024, 5:42 a.m. | Aaditya K. Singh, DJ Strouse

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.14903v1 Announce Type: cross
Abstract: Tokenization, the division of input text into input tokens, is an often overlooked aspect of the large language model (LLM) pipeline and could be the source of useful or harmful inductive biases. Historically, LLMs have relied on byte pair encoding, without care to specific input domains. With the increased use of LLMs for reasoning, various number-specific tokenization schemes have been adopted, with popular models like LLaMa and PaLM opting for single-digit tokenization while GPT-3.5 and …

abstract arxiv biases cs.cl cs.lg encoding impact inductive language language model large language large language model llm llms pipeline text tokenization tokens type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Intern Large Language Models Planning (f/m/x)

@ BMW Group | Munich, DE

Data Engineer Analytics

@ Meta | Menlo Park, CA | Remote, US