April 4, 2024, 4:42 a.m. | Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.02852v1 Announce Type: new
Abstract: Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model …

abstract arxiv budget cost cs.lg deepseek expert growth inference language language models large language large language models llms mixtral moe question scaling training transformers type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Associate Data Engineer

@ Nominet | Oxford/ Hybrid, GB

Data Science Senior Associate

@ JPMorgan Chase & Co. | Bengaluru, Karnataka, India