May 16, 2024, 4:42 a.m. | Samir Sadok, Simon Leglaive, Renaud S\'eguier

cs.LG updates on

arXiv:2305.03568v2 Announce Type: replace-cross
Abstract: The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned …

