all AI news
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Feb. 23, 2024, 5:48 a.m. | Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau
cs.CL updates on arXiv.org arxiv.org
Abstract: Practical large language model (LLM) services may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across numerous requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires …
abstract arxiv bottlenecks cost cs.cl documents examples knowledge language language model large language large language model latency llm next practical prompt prompts services type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York