all AI news
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
March 11, 2024, 4:41 a.m. | Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao
cs.LG updates on arXiv.org arxiv.org
Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of …
arxiv cache compression cs.ai cs.cl cs.lg gear generative inference llm near type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Data Engineer - Takealot Group (Takealot.com | Superbalist.com | Mr D Food)
@ takealot.com | Cape Town