March 22, 2024, 4:46 a.m. | Samuel Pegg, Kai Li, Xiaolin Hu

cs.CV updates on arXiv.org arxiv.org

arXiv:2309.17189v4 Announce Type: replace-cross
Abstract: Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network …

abstract aim art arxiv audio cs.cv cs.sd domain eess.as features generate however modeling modelling performance quality recognition sota speech speech recognition state tasks type visual

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne