all AI news
[D] How is it that the latency to decode 1 new token with an LLM is constant independent of total sequence length, when caching KV?
Nov. 24, 2023, 6:06 p.m. | /u/CatfishJones96
Machine Learning www.reddit.com
**How is it that the latency to decode 1 new token is constant independent of total sequence length (input+output)?**
Let’s assume batch size 1, and simple multi head attention. At each step t, even though we save recomputing KV for the entire sequence, we do have to compute attention using the current input’s Q against a growing kv cache, which represents more FLOPS …
attention cache compute current decode head independent latency machinelearning save simple token total
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Tableau/PowerBI Developer (A.Con)
@ KPMG India | Bengaluru, Karnataka, India
Software Engineer, Backend - Data Platform (Big Data Infra)
@ Benchling | San Francisco, CA