Web: http://arxiv.org/abs/2206.07684

June 16, 2022, 1:13 a.m. | Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

cs.CV updates on arXiv.org arxiv.org

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR
that incorporates visual cues, often from the movements of a speaker's mouth.
Unlike works that simply focus on the lip motion, we investigate the
contribution of entire visual frames (visual actions, objects, background
etc.). This is particularly useful for unconstrained videos, where the speaker
is not necessarily visible. To solve this task, we propose a new
sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained
end-to-end from spectrograms and full-frame RGB. …

arxiv avatar cv speech speech recognition

More from arxiv.org / cs.CV updates on arXiv.org

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY