April 9, 2024, 4:44 a.m. | Yilin Wang, Xinyi Hu, Matthew R. Gormley

cs.LG updates on arXiv.org arxiv.org

arXiv:2311.07853v2 Announce Type: replace-cross
Abstract: Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by …

abstract arxiv characters cs.cl cs.lg however information language language models model generalization multiple processes studies text tokenization tokens type words

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Scientist

@ ITE Management | New York City, United States