all AI news
Zero-Shot Tokenizer Transfer
May 14, 2024, 4:50 a.m. | Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vuli\'c
cs.CL updates on arXiv.org arxiv.org
Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in …
abstract arxiv cs.cl efficiency english example flexibility language language models languages lms maps natural programming programming languages raw text tokens transfer type zero-shot
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Senior Machine Learning Engineer
@ GPTZero | Toronto, Canada
Customer Data Analyst with Spanish
@ Michelin | Voluntari
HC Data Analyst - Senior
@ Leidos | 1662 Intelligence Community Campus - Bethesda MD
Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease
@ Clarivate | Remote (121- Massachusetts)
Data Analyst (maternity leave cover)
@ Clarivate | R155-Belgrade
Sales Enablement Data Analyst (Remote)
@ CrowdStrike | USA TX Remote