April 20, 2024, 2:48 p.m. | Eduardo Alvarez

Towards Data Science - Medium towardsdatascience.com

Created with Nightcafe — Image property of Author

Learn how to reduce model latency when deploying Meta* Llama 3 on CPUs

The much-anticipated release of Meta’s third-generation batch of Llama is here, and I want to ensure you know how to deploy this state-of-the-art (SoTA) LLM optimally. In this tutorial, we will focus on performing weight-only-quantization (WOQ) to compress the 8B parameter model and improve inference latency, but first, let’s discuss Meta Llama 3.

Llama 3

To date, the Llama …

art artificial intelligence cpu deploy face genai hugging face image inference latency llama llama 3 llm machine learning meta meta llama nightcafe property pytorch reduce release sota state tutorial

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York