Web: http://arxiv.org/abs/2206.11332

June 24, 2022, 1:12 a.m. | Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

cs.CL updates on arXiv.org arxiv.org

Finding word boundaries in continuous speech is challenging as there is
little or no equivalent of a 'space' delimiter between words. Popular Bayesian
non-parametric models for text segmentation use a Dirichlet process to jointly
segment sentences and build a lexicon of word types. We introduce DP-Parse,
which uses similar principles but only relies on an instance lexicon of word
tokens, avoiding the clustering errors that arise with a lexicon of word types.
On the Zero Resource Speech Benchmark 2017, our …

arxiv speech

More from arxiv.org / cs.CL updates on arXiv.org

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY