March 15, 2024, 4:48 a.m. | Piotr Nawrot, Adrian {\L}a\'ncucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.09636v1 Announce Type: new
Abstract: Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads …

abstract arxiv cache compression cs.cl dynamic however inference key language language models large language large language models llms memory solution store tokens transformers type value

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US