April 20, 2024, 2:48 p.m. | Eduardo Alvarez

Towards Data Science - Medium towardsdatascience.com

Created with Nightcafe — Image property of Author

Learn how to reduce model latency when deploying Meta* Llama 3 on CPUs

The much-anticipated release of Meta’s third-generation batch of Llama is here, and I want to ensure you know how to deploy this state-of-the-art (SoTA) LLM optimally. In this tutorial, we will focus on performing weight-only-quantization (WOQ) to compress the 8B parameter model and improve inference latency, but first, let’s discuss Meta Llama 3.

Llama 3

To date, the Llama …

art artificial intelligence cpu deploy face genai hugging face image inference latency llama llama 3 llm machine learning meta meta llama nightcafe property pytorch reduce release sota state tutorial

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Scientist, Commercial Analytics

@ Checkout.com | London, United Kingdom

Data Engineer I

@ Love's Travel Stops | Oklahoma City, OK, US, 73120