all AI news
Two Counterexamples to \textit{Tokenization and the Noiseless Channel}
Feb. 23, 2024, 5:48 a.m. | Marco Cognetta, Vil\'em Zouhar, Sangwhan Moon, Naoaki Okazaki
cs.CL updates on arXiv.org arxiv.org
Abstract: In \textit{Tokenization and the Noiseless Channel} \cite{zouhar-etal-2023-tokenization}, R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of …
abstract arxiv cs.cl distribution efficiency intrinsic leads nlp tasks tokenization type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Alternance DATA/AI Engineer (H/F)
@ SQLI | Le Grand-Quevilly, France