all AI news
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT. (arXiv:2205.07180v2 [eess.AS] UPDATED)
July 18, 2022, 1:12 a.m. | Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
cs.CV updates on arXiv.org arxiv.org
This paper investigates self-supervised pre-training for audio-visual speaker
representation learning where a visual stream showing the speaker's mouth area
is used alongside speech as inputs. Our study focuses on the Audio-Visual
Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose
audio-visual speech pre-training framework. We conducted extensive experiments
probing the effectiveness of pre-training and visual modality. Experimental
results suggest that AV-HuBERT generalizes decently to speaker related
downstream tasks, improving label efficiency by roughly ten fold for both
audio-only and audio-visual …
More from arxiv.org / cs.CV updates on arXiv.org
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
1 day, 18 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Director, Clinical Data Science
@ Aura | Remote USA
Research Scientist, AI (PhD)
@ Meta | Menlo Park, CA | New York City