all AI news
Multi-Modal Pre-Training for Automated Speech Recognition. (arXiv:2110.09890v2 [eess.AS] UPDATED)
cs.LG updates on arXiv.org arxiv.org
Traditionally, research in automated speech recognition has focused on
local-first encoding of audio representations to predict the spoken phonemes in
an utterance. Unfortunately, approaches relying on such hyper-local information
tend to be vulnerable to both local-level corruption (such as audio-frame
drops, or loud noises) and global-level noise (such as environmental noise, or
background noise) that has not been seen during training. In this work, we
introduce a novel approach which leverages a self-supervised learning technique
based on masked language modeling …
arxiv automated speech recognition pre-training speech speech recognition training