all AI news
Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models
March 4, 2024, 5:47 a.m. | Jinbiao Yang
cs.CL updates on arXiv.org arxiv.org
Abstract: Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more …
abstract adaptability arxiv balance complexity cs.cl encoding evolution language language models large language large language models limitations lms paper performance tokenization tokens traces type types word
More from arxiv.org / cs.CL updates on arXiv.org
Benchmarking LLMs via Uncertainty Quantification
1 day, 23 hours ago |
arxiv.org
CARE: Extracting Experimental Findings From Clinical Literature
1 day, 23 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
AIML - Sr Machine Learning Engineer, Data and ML Innovation
@ Apple | Seattle, WA, United States
Senior Data Engineer
@ Palta | Palta Cyprus, Palta Warsaw, Palta remote