April 16, 2024, 4:42 a.m. | Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.09529v1 Announce Type: new
Abstract: During inference for transformer-based large language models (LLM), prefilling is the computation of the key-value (KV) cache for input tokens in the prompt prior to autoregressive generation. For longer input prompt lengths, prefilling will incur a significant overhead on decoding time. In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs increasingly …

abstract arxiv autoregressive cache computation cs.ai cs.cl cs.lg decoding inference key language language models large language large language models llm prior prompt simple the key the prompt tokens transformer type value will

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Engineer

@ Quantexa | Sydney, New South Wales, Australia

Staff Analytics Engineer

@ Warner Bros. Discovery | NY New York 230 Park Avenue South