all AI news
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
April 9, 2024, 4:41 a.m. | Zihao Wang, Shaoduo Gan
cs.LG updates on arXiv.org arxiv.org
Abstract: Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions. Based on our observations regarding layer-wise importance in inference, …
arxiv budget cache cs.cl cs.lg inference layer llm management type via wise
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Data Scientist (Database Development)
@ Nasdaq | Bengaluru-Affluence