April 10, 2024, 4:42 a.m. | Anton Sch\"afer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.06508v1 Announce Type: cross
Abstract: Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, …

abstract arxiv core cs.cl cs.lg duplicate however language language modelling language models lms modelling near part process sample tokenisation training type

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Scientist

@ ITE Management | New York City, United States