March 18, 2024, 4:41 a.m. | Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.10444v1 Announce Type: new
Abstract: Speculative decoding has shown to be an effective method for lossless acceleration of large language models (LLMs) during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the final output follows the distribution of the large model. In all of the prior speculative decoding …

abstract algorithm arxiv block cs.cl cs.ds cs.it cs.lg decoding draft inference iteration language language models large language large language models llms math.it the algorithm tokens type verification

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

C003549 Data Analyst (NS) - MON 13 May

@ EMW, Inc. | Braine-l'Alleud, Wallonia, Belgium

Marketing Decision Scientist

@ Meta | Menlo Park, CA | New York City