Feb. 12, 2024, 5:42 a.m. | Amir Zandieh Insu Han Vahab Mirrokni Amin Karbasi

cs.LG updates on arXiv.org arxiv.org

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building …

attention caching challenges context cs.ai cs.ds cs.lg key language language models large language large language models llm llms memory requirements store success them token tokens value work

Research Scholar (Technical Research)

@ Centre for the Governance of AI | Hybrid; Oxford, UK

HPC Engineer (x/f/m) - DACH

@ Meshcapade GmbH | Remote, Germany

Director of Machine Learning

@ Axelera AI | Hybrid/Remote - Europe (incl. UK)

Senior Data Scientist - Trendyol Milla

@ Trendyol | Istanbul (All)

Data Scientist, Mid

@ Booz Allen Hamilton | USA, CA, San Diego (1615 Murray Canyon Rd)

Systems Development Engineer , Amazon Robotics Business Applications and Solutions Engineering

@ Amazon.com | Boston, Massachusetts, USA