April 26, 2024, 4:47 a.m. | In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong

cs.CL updates on arXiv.org arxiv.org

arXiv:2311.04934v2 Announce Type: replace
Abstract: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt …

abstract arxiv attention cache context cs.ai cs.cl documents inference insight key language language models large language large language models latency llm low messages modular prompt prompts text type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

DevOps Engineer (Data Team)

@ Reward Gateway | Sofia/Plovdiv