all AI news
ExLlamaV2: The Fastest Library to Run LLMs
Quantize and run EXL2 models
Quantizing Large Language Models (LLMs) is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library.
ExLlamaV2 is a library …