all AI news
[D] Why do GLUs (Gated Linear Units) work?
March 4, 2024, 5:18 p.m. | /u/cofapie
Machine Learning www.reddit.com
But the paper, "GLU Variants Improve Transformer" ([https://arxiv.org/pdf/2002.05202.pdf](https://arxiv.org/pdf/2002.05202.pdf)), just says that "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."
I also found the explanation in the original GLU paper to be unsatisfactory. They said that it has cleaner gradient, but I thought that this issue was already solved by residual connection.
Does anyone have …
found gradient issue linear llms machinelearning paper residual thought units variants work
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Data Engineer (m/f/d)
@ Project A Ventures | Berlin, Germany
Principle Research Scientist
@ Analog Devices | US, MA, Boston