all AI news
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
April 16, 2024, 4:42 a.m. | Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover
cs.LG updates on arXiv.org arxiv.org
Abstract: During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly …
abstract arxiv autoregressive cache computation cs.ai cs.cl cs.lg decoding inference key language language models large language large language models llm prior prompt simple the key the prompt tokens transformer type value will
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US