Nov. 20, 2023, 3:17 a.m. | Maxime Labonne

Towards Data Science - Medium towardsdatascience.com

Quantize and run EXL2 models

Image by author

Quantizing Large Language Models (LLMs) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library.

ExLlamaV2 is a library …

artificial intelligence data science large language models programming quantization

Data Engineer

@ Cepal Hellas Financial Services S.A. | Athens, Sterea Ellada, Greece

Senior Manager Data Engineering

@ Publicis Groupe | Bengaluru, India

Senior Data Modeler

@ Sanofi | Hyderabad

VP, Product Management - Data, AI & ML

@ Datasite | USA - MN - Minneapolis

Supervisão de Business Intelligence (BI)

@ Publicis Groupe | São Paulo, Brazil

Data Manager Advertising (f|m|d) (80-100%) - Zurich - Hybrid Work

@ SMG Swiss Marketplace Group | Zürich, Switzerland