all AI news
Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention. (arXiv:2204.03479v1 [cs.CL])
cs.LG updates on arXiv.org arxiv.org
Multi-head self-attention forms the core of Transformer networks. However,
their quadratically growing complexity with respect to the input sequence
length impedes their deployment on resource-constrained edge devices. We
address this challenge by proposing a dynamic pruning method, which exploits
the temporal stability of data across tokens to reduce inference cost. The
threshold-based method only retains significant differences between the
subsequent tokens, effectively reducing the number of multiply-accumulates, as
well as the internal tensor data sizes. The approach is evaluated on …
arxiv attention delta edge head self-attention transformer transformers