Web: http://arxiv.org/abs/2201.10439

Sept. 15, 2022, 1:13 a.m. | Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

cs.CV updates on arXiv.org arxiv.org

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition
by introducing the video modality as an additional source of information. In
this work, the information contained in the motion of the speaker's mouth is
used to augment the audio features. The video modality is traditionally
processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks arXiv:2010.11929 demonstrated the ability
to extract rich visual features for image classification tasks. Here, we
propose to replace the 3D …

arxiv audio person speech speech recognition transformer video

