April 8, 2022, 1:11 a.m. | Zuzana Jelčicová, Marian Verhelst

cs.LG updates on arXiv.org arxiv.org

Multi-head self-attention forms the core of Transformer networks. However,
their quadratically growing complexity with respect to the input sequence
length impedes their deployment on resource-constrained edge devices. We
address this challenge by proposing a dynamic pruning method, which exploits
the temporal stability of data across tokens to reduce inference cost. The
threshold-based method only retains significant differences between the
subsequent tokens, effectively reducing the number of multiply-accumulates, as
well as the internal tensor data sizes. The approach is evaluated on …

arxiv attention delta edge head self-attention transformer transformers

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Data Analyst, Tableau

@ NTT DATA | Bengaluru, KA, IN

Junior Machine Learning Researcher

@ Weill Cornell Medicine | Doha, QA, 24144

Marketing Data Analytics Intern

@ Sloan | Franklin Park, IL, US, 60131

Senior Machine Learning Scientist

@ Adyen | Amsterdam

Data Engineer

@ Craft.co | Warsaw, Mazowieckie