May 7, 2024, 5:36 a.m. | /u/Fit-Flow-4180

Machine Learning www.reddit.com

I’ve read that inference speed for models like Llama-2 70B is ~10 t/s at best. So that left me wondering how the extremely large models like GPT-4 (1T params?) do their fast 20 t/s inference. With 10x the params, they gotta have at least 3x the layers(?) So that should make its inference much slower. Am I missing anything? What kind of further improvements might these companies be doing to power their fast APIs?

Edit: I must mention that you …

70b art gpt gpt-4 inference large models least llama llms machinelearning params speed state state of the art work

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US