June 5, 2024, 4:52 a.m. | Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.02532v1 Announce Type: new
Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When …

abstract adoption algorithms arxiv consumer consumer devices cs.cl datacenter decoding design devices hardware however inference interactive language language models large language large language models llm running them type work

AI Focused Biochemistry Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Senior Data Engineer

@ Displate | Warsaw

PhD Student AI simulation electric drive (f/m/d)

@ Volkswagen Group | Kassel, DE, 34123

AI Privacy Research Lead

@ Leidos | 6314 Remote/Teleworker US

Senior Platform System Architect, Silicon

@ Google | New Taipei, Banqiao District, New Taipei City, Taiwan

Fabrication Hardware Litho Engineer, Quantum AI

@ Google | Goleta, CA, USA