April 25, 2024, 2:16 p.m. | /u/kiockete

Machine Learning www.reddit.com

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So …

block example gradient layer machinelearning path residual sum transformer transformers wise

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne