Nov. 13, 2023, 7:27 p.m. | /u/APaperADay

Machine Learning www.reddit.com

**Paper**: [https://arxiv.org/abs/2311.01906](https://arxiv.org/abs/2311.01906)

**GitHub**: [https://github.com/bobby-he/simplified\_transformers](https://github.com/bobby-he/simplified_transformers)

**Abstract**:

>A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.
In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical …

abstract architectures attention building complexity design leads machinelearning mlp recipe reduce simple speed standard training transformer transformers work

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Scientist, Commercial Analytics

@ Checkout.com | London, United Kingdom

Data Engineer I

@ Love's Travel Stops | Oklahoma City, OK, US, 73120