April 25, 2024, 4:13 p.m. | /u/SeawaterFlows

Machine Learning www.reddit.com

**Paper**: [https://arxiv.org/abs/2402.11131](https://arxiv.org/abs/2402.11131)

**Abstract**:

>Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose **Speculative Streaming**, a single-model speculative decoding method that fuses drafting into the target model by …

abstract application complexity decoding draft fine-tuning inference language language model machinelearning predictions speed tasks while

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US