all AI news
Experiments with Mixtral-8x7B using Multiple Libraries - Got max 52 tokens/sec. Thoughts?
Feb. 1, 2024, 1:10 a.m. | /u/Tiny_Cut_8440
machinelearningnews www.reddit.com
Recently experimented with deploying the Mixtral-8x7B model and wanted to share key findings for those interested:
**Best Performance**: With Quantized 8-bit model using Pytorch(nightly) got an average token generation rate of 52.03 token/sec on A100, average inference of 4.94 seconds and cold-start 11.48 secs ( matters when deployed in serverless environment)
https://preview.redd.it/93l5oydhjvfc1.png?width=1600&format=png&auto=webp&s=300e6d690d3de995db86fedf633bec25d149b935
**Other Libraries Tested:** vLLM, AutoGPTQ, HQQ
Here is the link to the tutorial - [https://tutorials.inferless.com/deploy-mixtral-8x7b-for-52-tokens-sec-on-a-single-gpu](https://tutorials.inferless.com/deploy-mixtral-8x7b-for-52-tokens-sec-on-a-single-gpu)
Keen to hear your experiences and learnings in similar deployments!
a100 inference key libraries machinelearningnews max mixtral multiple performance pytorch rate sec thoughts token tokens
More from www.reddit.com / machinelearningnews
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US