Feb. 12, 2024, 5:42 a.m. | Amir Zandieh Insu Han Vahab Mirrokni Amin Karbasi

cs.LG updates on arXiv.org arxiv.org

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building …

attention caching challenges context cs.ai cs.ds cs.lg key language language models large language large language models llm llms memory requirements store success them token tokens value work

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist, Demography and Survey Science, University Grad

@ Meta | Menlo Park, CA | New York City

Computer Vision Engineer, XR

@ Meta | Burlingame, CA