all AI news
Simplifying Transformer Blocks
June 3, 2024, 4:44 a.m. | Bobby He, Thomas Hofmann
cs.LG updates on arXiv.org arxiv.org
Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.
In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical …
abstract architectures arxiv attention building complexity compose cs.lg design leads mlp recipe reduce render replace simple simplifying speed standard training transformer transformers type
More from arxiv.org / cs.LG updates on arXiv.org
MixerFlow: MLP-Mixer meets Normalising Flows
34 minutes ago |
arxiv.org
Kernelised Normalising Flows
34 minutes ago |
arxiv.org
Jobs in AI, ML, Big Data
AI Focused Biochemistry Postdoctoral Fellow
@ Lawrence Berkeley National Lab | Berkeley, CA
Senior Data Engineer
@ Displate | Warsaw
Solutions Architect
@ PwC | Bucharest - 1A Poligrafiei Boulevard
Research Fellow (Social and Cognition Factors, CLIC)
@ Nanyang Technological University | NTU Main Campus, Singapore
Research Aide - Research Aide I - Department of Psychology
@ Cornell University | Ithaca (Main Campus)
Technical Architect - SMB/Desk
@ Salesforce | Ireland - Dublin