Feb. 12, 2024, 4 p.m. | /u/choHZ

Machine Learning www.reddit.com

It is well known that batch inference is a common practice for efficient LLM serving (which is one primary reason why services like ChatGPT have an initial delay). This batching practice is motivated by the fact that inference latency is mostly limited by the I/O cost of model loading but not the actual compute, where serving multiple requests in a batched manner adds tolerable latency increase while bringing in massive savings on cost per token. However, one issue of batched …

cache challenge explore index key machinelearning observation outliers quantization the key tokens value wise

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Consultant - Artificial Intelligence & Data (Google Cloud Data Engineer) - MY / TH

@ Deloitte | Kuala Lumpur, MY