Web: http://arxiv.org/abs/2110.06650

May 5, 2022, 1:12 a.m. | Andreas Triantafyllopoulos, Uwe Reichel, Shuo Liu, Stephan Huber, Florian Eyben, Björn W. Schuller

cs.LG updates on arXiv.org arxiv.org

In this contribution, we investigate the effectiveness of deep fusion of text
and audio features for categorical and dimensional speech emotion recognition
(SER). We propose a novel, multistage fusion method where the two information
streams are integrated in several layers of a deep neural network (DNN), and
contrast it with a single-stage one where the streams are merged in a single
point. Both methods depend on extracting summary linguistic embeddings from a
pre-trained BERT model, and conditioning one or more …

arxiv emotion speech

