Feb. 8, 2024, 5:46 a.m. | Ju-Chieh Chou Chung-Ming Chien Karen Livescu

cs.CL updates on arXiv.org arxiv.org

Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual …

cs.cl cs.sd eess.as

