Dec. 29, 2023, 11:12 a.m. | /u/alagagbar

Machine Learning www.reddit.com

According to the [GLU Variants improve Transformers paper](https://arxiv.org/pdf/2002.05202.pdf) the best performing gated linear unit on average is SwiGLU. The same GLU used in LLAMA and PaLM architecture.

In my language modeling experiments I was using this PaLM-like SwiGLU FFN:

class FFNSwiGLU(nn.Module):
def __init__(self, d_model: int) -> None:
super().__init__()
self.fc1 = nn.Linear(d_model, d_model * 4, bias=False)
self.fc2 = nn.Linear(d_model * 2, d_model, bias=False)

def forward(self, x: torch.Tensor) -> torch.Tensor:
x1, x2 = self.fc1.forward(x).chunk(2, dim=-1)
x = F.silu(x1) * x2
x = …

bias false language linear machinelearning modeling palm tensor torch

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Engineer

@ Quantexa | Sydney, New South Wales, Australia

Staff Analytics Engineer

@ Warner Bros. Discovery | NY New York 230 Park Avenue South