all AI news
On the Effect of (Near) Duplicate Subwords in Language Modelling
April 10, 2024, 4:42 a.m. | Anton Sch\"afer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel
cs.LG updates on arXiv.org arxiv.org
Abstract: Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, …
abstract arxiv core cs.cl cs.lg duplicate however language language modelling language models lms modelling near part process sample tokenisation training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Data Scientist
@ ITE Management | New York City, United States