Feb. 12, 2024, 5:42 a.m. | Amir Zandieh Insu Han Vahab Mirrokni Amin Karbasi

cs.LG updates on arXiv.org arxiv.org

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building …

