all AI news
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
April 26, 2024, 4:47 a.m. | In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
cs.CL updates on arXiv.org arxiv.org
Abstract: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt …
abstract arxiv attention cache context cs.ai cs.cl documents inference insight key language language models large language large language models latency llm low messages modular prompt prompts text type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
DevOps Engineer (Data Team)
@ Reward Gateway | Sofia/Plovdiv