all AI news
[D] Experiments with Mixtral-8x7B using Multiple Libraries - Got max 52 tokens/sec. Thoughts?
Jan. 30, 2024, 12:40 a.m. | /u/Tiny_Cut_8440
Machine Learning www.reddit.com
Recently experimented with deploying the Mixtral-8x7B model and wanted to share key findings for those interested:
**Best Performance**: With Quantized 8-bit model using Pytorch(nightly) got an average token generation rate of 52.03 token/sec on A100, average inference of 4.94 seconds and cold-start 11.48 secs ( matters when deployed in serverless environment)
[Mixtral Experiments](https://preview.redd.it/i7mbjzl74hfc1.png?width=1600&format=png&auto=webp&s=1bb27c889d3b76a50b33cd549a7156702b5b4ae3)
**Other Libraries Tested:** vLLM, AutoGPTQ, HQQ
Keen to hear your experiences and learnings in similar deployments!
a100 inference key libraries machinelearning max mixtral multiple performance pytorch rate sec thoughts token tokens
More from www.reddit.com / Machine Learning
[D] UI-based Agents - the next big thing?
1 day, 1 hour ago |
www.reddit.com
[D] Any-dimensional equivariant neural networks
1 day, 1 hour ago |
www.reddit.com
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne