Feb. 12, 2024, 5:46 a.m. | Siyu Ren Kenny Q. Zhu

cs.CL updates on arXiv.org arxiv.org

Despite the recent success associated with Large Language Models~(LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in …

cs.ai cs.cl

