June 11, 2024, 4:42 a.m. | Avihu Dekel, Raul Fernandez

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.05547v1 Announce Type: cross
Abstract: Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three …

abstract algorithms arxiv audio become benefits cs.cl cs.sd eess.as however language language processing merge natural natural language natural language processing playing processing rate representation standard tasks tokenization type units

