Feb. 13, 2024, 5:43 a.m. | Jakub Krajewski Jan Ludziejewski Kamil Adamczewski Maciej Pi\'oro Micha{\l} Krutul Szymon Antoniak Kam

cs.LG updates on arXiv.org arxiv.org

Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the …

analyze building computational control cost cs.ai cs.cl cs.lg experts fine-grained hyperparameter language language models large language large language models laws mixture of experts moe scaling solution variables work

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist (Computer Science)

@ Nanyang Technological University | NTU Main Campus, Singapore

Intern - Sales Data Management

@ Deliveroo | Dubai, UAE (Main Office)