all AI news
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design. (arXiv:2401.16456v1 [cs.CV])
cs.CV updates on arXiv.org arxiv.org
Recently, efficient Vision Transformers have shown great performance with low
latency on resource-constrained devices. Conventionally, they use 4x4 patch
embeddings and a 4-stage structure at the macro level, while utilizing
sophisticated attention with multi-head configuration at the micro level. This
paper aims to address computational redundancy at all design levels in a
memory-efficient manner. We discover that using larger-stride patchify stem not
only reduces memory access costs but also achieves competitive performance by
leveraging token representations with reduced spatial redundancy …
arxiv attention computational cs.cv design devices embeddings head latency low low latency macro memory multi-head paper performance redundancy stage transformer transformers vision vision transformers