May 14, 2024, 4:50 a.m. | Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vuli\'c

cs.CL updates on arXiv.org arxiv.org

arXiv:2405.07883v1 Announce Type: new
Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in …

abstract arxiv cs.cl efficiency english example flexibility language language models languages lms maps natural programming programming languages raw text tokens transfer type zero-shot

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

Customer Data Analyst with Spanish

@ Michelin | Voluntari

HC Data Analyst - Senior

@ Leidos | 1662 Intelligence Community Campus - Bethesda MD

Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease

@ Clarivate | Remote (121- Massachusetts)

Data Analyst (maternity leave cover)

@ Clarivate | R155-Belgrade

Sales Enablement Data Analyst (Remote)

@ CrowdStrike | USA TX Remote