all AI news
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. (arXiv:2103.06874v4 [cs.CL] UPDATED)
cs.LG updates on arXiv.org arxiv.org
Pipelined NLP systems have largely been superseded by end-to-end neural
modeling, yet nearly all commonly-used models still require an explicit
tokenization step. While recent tokenization approaches based on data-derived
subword lexicons are less brittle than manually engineered tokenizers, these
techniques are not equally suited to all languages, and the use of any fixed
vocabulary may limit a model's ability to adapt. In this paper, we present
CANINE, a neural encoder that operates directly on character sequences, without
explicit tokenization or …
arxiv encoder free language pre-training representation tokenization training