Web: http://arxiv.org/abs/2206.07882

June 17, 2022, 1:12 a.m. | Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

cs.CL updates on arXiv.org arxiv.org

We report on aggressive quantization strategies that greatly accelerate
inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit
integer representation for both weights and activations and apply Quantization
Aware Training (QAT) to retrain the full model (acoustic encoder and language
model) and achieve near-iso-accuracy. We show that customized quantization
schemes that are tailored to the local properties of the network are essential
to achieve good performance while limiting the computational overhead of QAT.

Density ratio Language Model …

arxiv fusion inference language language model model network neural neural network quantization recurrent neural network

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY