Jan. 4, 2024, 1:11 p.m. | /u/Tiny_Cut_8440

Machine Learning www.reddit.com

Hello everyone,

Been working on optimizing upstart.ai SOLAR-10.7B-Instruct-v1.0 model and wanted to share our insights:

🚀 **Our Approach:** Quantized the model using Auto-GPTQ, then deployed with vLLM.

Results: In a serverless setup, we saw 1.37 sec inference, 111.54 tokens/sec, and an 11.69 sec cold start on Nvidia A100 GPU.

https://preview.redd.it/kel8cn5dafac1.png?width=1600&format=png&auto=webp&s=5bca8b5e4a48f5f7a709f44bc431844746c61a77

Other Methods Tested: Although Auto-GPTQ was an option, our experience suggests that vLLM is the superior choice for deployment.

Looking forward to hearing about your experiences with similar projects!

a100 a100 gpu auto cold start gpu hello inference insights machinelearning nvidia nvidia a100 nvidia a100 gpu sec serverless setup solar tokens

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US