May 8, 2024, 4:41 a.m. | Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

cs.LG updates on arXiv.org arxiv.org

arXiv:2405.03917v1 Announce Type: new
Abstract: Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct …

abstract arxiv batching become cache context context length contributor cs.lg deployment inference key language language model language models large language large language model large language models llms multiple per quantization the key together type value

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US