Nov. 21, 2023, 4:29 a.m. | /u/APaperADay

Machine Learning

**Paper**: [](

**Code**: [](


>This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of …

abstract analysis architecture art attention behavior components distillation knowledge machinelearning networks simple standard state tasks transformer transformer model work

Machine Learning Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, Ca

Senior Data Engineer (Microsoft Azure)

@ Capco | UK - London

Senior Data Analyst

@ Publicis Groupe | Bengaluru, India

Senior Data Engineer

@ Press Ganey | Chicago, IL, United States

Senior Data Scientist (remote from EU)

@ PriceHubble | Vienna, Vienna, Austria - Remote

Data Science Co-op

@ Novelis | Atlanta, GA, United States