Nov. 5, 2023, 6:44 a.m. | Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

cs.LG updates on arXiv.org arxiv.org

Large Language Models (LLMs) from the GPT family have become extremely
popular, leading to a race towards reducing their inference costs to allow for
efficient local computation. Yet, the vast majority of existing work focuses on
weight-only quantization, which can reduce runtime costs in the memory-bound
one-token-at-a-time generative setting, but does not address them in
compute-bound scenarios, such as batched inference or prompt processing. In
this paper, we address the general quantization problem, where both weights and
activations should be …

arxiv become computation costs family generative gpt inference inference costs language language models large language large language models llms memory popular quantization race reduce token vast work

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US