March 4, 2024, 5:18 p.m. | /u/cofapie

Machine Learning www.reddit.com

Today, GLU variants, such as SwiGLU, are used very often in LLMs.

But the paper, "GLU Variants Improve Transformer" ([https://arxiv.org/pdf/2002.05202.pdf](https://arxiv.org/pdf/2002.05202.pdf)), just says that "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

I also found the explanation in the original GLU paper to be unsatisfactory. They said that it has cleaner gradient, but I thought that this issue was already solved by residual connection.

Does anyone have …

found gradient issue linear llms machinelearning paper residual thought units variants work

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston