all AI news
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
June 5, 2024, 4:52 a.m. | Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
cs.CL updates on arXiv.org arxiv.org
Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When …
abstract adoption algorithms arxiv consumer consumer devices cs.cl datacenter decoding design devices hardware however inference interactive language language models large language large language models llm running them type work
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Focused Biochemistry Postdoctoral Fellow
@ Lawrence Berkeley National Lab | Berkeley, CA
Senior Data Engineer
@ Displate | Warsaw
PhD Student AI simulation electric drive (f/m/d)
@ Volkswagen Group | Kassel, DE, 34123
AI Privacy Research Lead
@ Leidos | 6314 Remote/Teleworker US
Senior Platform System Architect, Silicon
@ Google | New Taipei, Banqiao District, New Taipei City, Taiwan
Fabrication Hardware Litho Engineer, Quantum AI
@ Google | Goleta, CA, USA