March 17, 2024, 2:22 a.m. | /u/Primary-Try8050

Machine Learning www.reddit.com

I don't understand how backprop works on sparsely gated MoE

In the context of LLM, say you have n experts and you chose the top k for each token.

During training, the gate network could be completely wrong and leave the correct expert out of the chosen k. However, since the correct expert was not used, it gives no chance for the gate to increase the weight of the correct expert.

In other terms, during backdrop, only part of the …

context expert experts gate however llm machinelearning moe network token training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne