[D] Why transformers are not trained layer-wise? | allainews.com

April 25, 2024, 2:16 p.m. | /u/kiockete

Machine Learning www.reddit.com

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So …

block example gradient layer machinelearning path residual sum transformer transformers wise

More from www.reddit.com / Machine Learning

[P] [D] Is inference time the important performance metric for ML Models on edge/mobile? 5 hours ago | www.reddit.com

apps devices edge embed +15

[D] Any-dimensional equivariant neural networks 6 hours ago | www.reddit.com

abstract assumptions authors cases +18

How are large network attack datasets made? [p] 11 hours ago | www.reddit.com

attacks datasets detection free +5

A Multi-Agent game where LLMs must trick each other as humans until one gets caught … 13 hours ago | www.reddit.com

agent fun game humans +7

[D] How reliable is RAG currently? 14 hours ago | www.reddit.com

context context window documents machinelearning +5

[N] New Challenges in DIAMBRA Arena: 3 epic additions to our lineup of RL environments! 14 hours ago | www.reddit.com

arena challenges environments epic +1

[R] An Analysis of Linear Time Series Forecasting Models 16 hours ago | www.reddit.com

abstract analysis forecasting form +9

[D] The "it" in AI models is really just the dataset? 17 hours ago | www.reddit.com

ai models dataset machinelearning

[D] Analysis of Time To First Token (TTFT) of LLMs (10B-34B) 19 hours ago | www.reddit.com

analysis containers docker hey +10

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net