[R][P] KV Cache is huge and bottlenecks LLM inference. We quantize them to 2bit in a finetuning-free + plug-and-play fashion. | allainews.com

Feb. 12, 2024, 4 p.m. | /u/choHZ

Machine Learning www.reddit.com

It is well known that batch inference is a common practice for efficient LLM serving (which is one primary reason why services like ChatGPT have an initial delay). This batching practice is motivated by the fact that inference latency is mostly limited by the I/O cost of model loading but not the actual compute, where serving multiple requests in a batched manner adds tolerable latency increase while bringing in massive savings on cost per token. However, one issue of batched …

cache challenge explore index key machinelearning observation outliers quantization the key tokens value wise

More from www.reddit.com / Machine Learning

[D] Stack Overflow partnership with OPEN AI 3 hours ago | www.reddit.com

access chart chat chat gpt +16

[D] How does fast inference work with state of the art LLMs? 5 hours ago | www.reddit.com

70b art gpt gpt-4 +11

[D] Llama 3 Monstrosities 21 hours ago | www.reddit.com

create easy life llama +4

[D] Get paid for peer reviews on ResearchHub 1 day ago | www.reddit.com

cryptocurrency editor machinelearning mind +6

[D] NER for large text data 1 day, 1 hour ago | www.reddit.com

billion data data scientist hello +8

[P] Table Extraction , Text Extraction 1 day, 1 hour ago | www.reddit.com

block column dataset design +13

[P] LeRobot: Hugging Face's library for real-world robotics 1 day, 3 hours ago | www.reddit.com

academia advanced advanced ai ai development +13

[D] Kolmogorov-Arnold Network is just an MLP 1 day, 4 hours ago | www.reddit.com

machinelearning mlp network relu +1

[D] Why Gemma has such crazy big MLP hidden dim size? 1 day, 4 hours ago | www.reddit.com

big gemma hidden machinelearning +1

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Consultant - Artificial Intelligence & Data (Google Cloud Data Engineer) - MY / TH

@ Deloitte | Kuala Lumpur, MY

View on ai-jobs.net