Zero-Shot Tokenizer Transfer | allainews.com

May 14, 2024, 4:50 a.m. | Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vuli\'c

cs.CL updates on arXiv.org arxiv.org

arXiv:2405.07883v1 Announce Type: new
Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in …

abstract arxiv cs.cl efficiency english example flexibility language language models languages lms maps natural programming programming languages raw text tokens transfer type zero-shot

More from arxiv.org / cs.CL updates on arXiv.org

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications 2 hours ago | arxiv.org

abstract applications arxiv challenge +26

Unlearning Traces the Influential Training Data of Language Models 2 hours ago | arxiv.org

abstract arxiv cs.ai cs.cl +17

Axis Tour: Word Tour Determines the Order of Axes in ICA-transformed Embeddings 2 hours ago | arxiv.org

abstract analysis arxiv components +20

Japanese Tort-case Dataset for Rationale-supported Legal Judgment Prediction 2 hours ago | arxiv.org

abstract arxiv case court +14

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI 2 hours ago | arxiv.org

abstract agi art arxiv +21

ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology 2 hours ago | arxiv.org

abstract arxiv benchmark benchmarks +19

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China 2 hours ago | arxiv.org

abstract accessibility arxiv challenge +19

Dodo: Dynamic Contextual Compression for Decoder-only LMs 2 hours ago | arxiv.org

abstract arxiv attention compression +23

Active Learning for Multilingual Fingerspelling Corpora 2 hours ago | arxiv.org

abstract active learning analysis apply +16

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

View on ai-jobs.net

Customer Data Analyst with Spanish

@ Michelin | Voluntari

View on ai-jobs.net

HC Data Analyst - Senior

@ Leidos | 1662 Intelligence Community Campus - Bethesda MD

View on ai-jobs.net

Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease

@ Clarivate | Remote (121- Massachusetts)

View on ai-jobs.net

Data Analyst (maternity leave cover)

@ Clarivate | R155-Belgrade

View on ai-jobs.net

Sales Enablement Data Analyst (Remote)

@ CrowdStrike | USA TX Remote

View on ai-jobs.net