all AI news
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization. (arXiv:2206.07882v1 [cs.CL])
Web: http://arxiv.org/abs/2206.07882
cs.CL updates on arXiv.org arxiv.org
We report on aggressive quantization strategies that greatly accelerate
inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit
integer representation for both weights and activations and apply Quantization
Aware Training (QAT) to retrain the full model (acoustic encoder and language
model) and achieve near-iso-accuracy. We show that customized quantization
schemes that are tailored to the local properties of the network are essential
to achieve good performance while limiting the computational overhead of QAT.
Density ratio Language Model …
arxiv fusion inference language language model model network neural neural network quantization recurrent neural network