Jan. 26, 2022 | Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang

The recurrent neural network transducer (RNN-T) has recently become the
mainstream end-to-end approach for streaming automatic speech recognition
(ASR). To estimate the output distributions over subword units, RNN-T uses a
fully connected layer as the joint network to fuse the acoustic representations
extracted using the acoustic encoder with the text representations obtained
using the prediction network based on the previous subword units. In this
paper, we propose to use gating, bilinear pooling, and a combination of them in
the joint …

