April 6, 2024, 3:42 p.m. | /u/RepresentativeWay0

Machine Learning www.reddit.com

Looking at the code for current mixture of experts models, they seem to use argmax, with k=1 (picking only the top expert) to select the router choice. Since argmax is non differentiable, the gradient cannot flow to the other experts. Thus it seems to me that only the weights of the selected expert will be updated if it performs poorly. However, it could be the case that a different expert was in fact a better choice for the given input, …

code current differentiable expert experts flow gradient learn machinelearning mixture of experts moe

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Science Analyst- ML/DL/LLM

@ Mayo Clinic | Jacksonville, FL, United States

Machine Learning Research Scientist, Robustness and Uncertainty

@ Nuro, Inc. | Mountain View, California (HQ)