Feb. 10, 2024, 3:39 p.m. | /u/ashz8888

Machine Learning www.reddit.com

MoE models like Mixtral 8x7B use 8 distinct experts of dense matrices in fully connected blocks, two of which are selected by a router network and their outputs are combined, while processing a token.

Since only two of the groups need to be loaded to the memory, while others remain offloaded, this requires the model to use 12.9B parameters out of the 46.7B total parameters at any point.

I'm wondering to bring the parameters down to almost the same level …

experts inferencing lora machinelearning mixtral mixtral 8x7b moe network processing token training

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Tableau/PowerBI Developer (A.Con)

@ KPMG India | Bengaluru, Karnataka, India

Software Engineer, Backend - Data Platform (Big Data Infra)

@ Benchling | San Francisco, CA