all AI news
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models. (arXiv:2310.09259v2 [cs.LG] UPDATED)
cs.LG updates on arXiv.org arxiv.org
Large Language Models (LLMs) from the GPT family have become extremely
popular, leading to a race towards reducing their inference costs to allow for
efficient local computation. Yet, the vast majority of existing work focuses on
weight-only quantization, which can reduce runtime costs in the memory-bound
one-token-at-a-time generative setting, but does not address them in
compute-bound scenarios, such as batched inference or prompt processing. In
this paper, we address the general quantization problem, where both weights and
activations should be …
arxiv become computation costs family generative gpt inference inference costs language language models large language large language models llms memory popular quantization race reduce token vast work