Web: http://arxiv.org/abs/2206.11332

June 24, 2022, 1:12 a.m. | Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

cs.CL updates on arXiv.org arxiv.org

Finding word boundaries in continuous speech is challenging as there is
little or no equivalent of a 'space' delimiter between words. Popular Bayesian
non-parametric models for text segmentation use a Dirichlet process to jointly
segment sentences and build a lexicon of word types. We introduce DP-Parse,
which uses similar principles but only relies on an instance lexicon of word
tokens, avoiding the clustering errors that arise with a lexicon of word types.
On the Zero Resource Speech Benchmark 2017, our …

arxiv speech

