Nov. 24, 2023, 6:06 p.m. | /u/CatfishJones96

Machine Learning www.reddit.com

I’m struggling to understand something about [transformer inference arithmetic](https://kipp.ly/transformer-inference-arithmetic/) with KV caching, together with some benchmarking results.

**How is it that the latency to decode 1 new token is constant independent of total sequence length (input+output)?**

Let’s assume batch size 1, and simple multi head attention. At each step t, even though we save recomputing KV for the entire sequence, we do have to compute attention using the current input’s Q against a growing kv cache, which represents more FLOPS …

attention cache compute current decode head independent latency machinelearning save simple token total

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Tableau/PowerBI Developer (A.Con)

@ KPMG India | Bengaluru, Karnataka, India

Software Engineer, Backend - Data Platform (Big Data Infra)

@ Benchling | San Francisco, CA