June 11, 2024, 4:42 a.m. | Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.06371v1 Announce Type: new
Abstract: We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment over the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations and with only 95M parameters, mHuBERT-147 outperforms larger models trained on substantially more data. We rank …

abstract apply arxiv assignment batching clustering compact cs.cl cs.sd data eess.as faiss faster general iteration license massively multilingual multilingual representation sampling scale speech strategy type

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Senior Research Engineer/Specialist - Motor Mechanical Design

@ GKN Aerospace | Bristol, GB

Research Engineer (Motor Mechanical Design)

@ GKN Aerospace | Bristol, GB

Senior Research Engineer (Electromagnetic Design)

@ GKN Aerospace | Bristol, GB

Associate Research Engineer Clubs | Titleist

@ Acushnet Company | Carlsbad, CA, United States