all AI news
Patch-wise Mixed-Precision Quantization of Vision Transformer. (arXiv:2305.06559v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
As emerging hardware begins to support mixed bit-width arithmetic
computation, mixed-precision quantization is widely used to reduce the
complexity of neural networks. However, Vision Transformers (ViTs) require
complex self-attention computation to guarantee the learning of powerful
feature representations, which makes mixed-precision quantization of ViTs still
challenging. In this paper, we propose a novel patch-wise mixed-precision
quantization (PMQ) for efficient inference of ViTs. Specifically, we design a
lightweight global metric, which is faster than existing methods, to measure
the sensitivity of …
arxiv attention complexity computation feature hardware mixed mixed-precision networks neural networks paper precision quantization reduce self-attention support transformer transformers vision