all AI news
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
April 16, 2024, 4:42 a.m. | Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover
cs.LG updates on arXiv.org arxiv.org
Abstract: During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly …
abstract arxiv autoregressive cache computation cs.ai cs.cl cs.lg decoding inference key language language models large language large language models llm prior prompt simple the key the prompt tokens transformer type value will
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Data Engineer
@ Quantexa | Sydney, New South Wales, Australia
Staff Analytics Engineer
@ Warner Bros. Discovery | NY New York 230 Park Avenue South