all AI news
[R] Efficient Memory Management for Large Language Model Serving with PagedAttention - UC Berkeley et al 2023 - 2-4x higher throughput than HuggingFace Transformers without requiring any model architecture changes!
Sept. 13, 2023, 7:55 p.m. | /u/Singularian2501
Machine Learning www.reddit.com
Github: [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
Blog: [https://vllm.ai/](https://vllm.ai/)
Abstract:
>High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging …
abstract batching fragmentation language language models large language large language models llms machinelearning managed memory systems the key value
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Software Engineer, Generative AI (C++)
@ SoundHound Inc. | Toronto, Canada