all AI news
Multi-Word Tokenization for Sequence Compression
Feb. 16, 2024, 5:43 a.m. | Leonidas Gee, Leonardo Rigutini, Marco Ernandes, Andrea Zugarini
cs.LG updates on arXiv.org arxiv.org
Abstract: Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this pa005 per, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a …
abstract arxiv beyond compression computational cost cs.cl cs.lg industrial language language models large language large language models modelling per tasks tokenization tokens type word
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Machine Learning Research Scientist
@ d-Matrix | San Diego, Ca