Jan. 30, 2024, 12:40 a.m. | /u/Tiny_Cut_8440

Machine Learning www.reddit.com

Hi everyone,

Recently experimented with deploying the Mixtral-8x7B model and wanted to share key findings for those interested:

**Best Performance**: With Quantized 8-bit model using Pytorch(nightly) got an average token generation rate of 52.03 token/sec on A100, average inference of 4.94 seconds and cold-start 11.48 secs ( matters when deployed in serverless environment)

[Mixtral Experiments](https://preview.redd.it/i7mbjzl74hfc1.png?width=1600&format=png&auto=webp&s=1bb27c889d3b76a50b33cd549a7156702b5b4ae3)

**Other Libraries Tested:** vLLM, AutoGPTQ, HQQ

Keen to hear your experiences and learnings in similar deployments!

a100 inference key libraries machinelearning max mixtral multiple performance pytorch rate sec thoughts token tokens

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US