Aug. 18, 2022, 5:28 p.m. | /u/Singularian2501

Machine Learning www.reddit.com

Paper: [https://arxiv.org/abs/2208.07339](https://arxiv.org/abs/2208.07339)

Github: [https://github.com/timdettmers/bitsandbytes](https://github.com/timdettmers/bitsandbytes)

Software Blogpost: [https://huggingface.co/blog/hf-bitsandbytes-integration](https://huggingface.co/blog/hf-bitsandbytes-integration)

Emergent Features Blogpost: [https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)

Abstract:

>Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible …

ai consumer facebook facebook ai gpus inference llm llms machinelearning making performance scale server transformers

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Staff Software Engineer, Generative AI, Google Cloud AI

@ Google | Mountain View, CA, USA; Sunnyvale, CA, USA

Expert Data Sciences

@ Gainwell Technologies | Any city, CO, US, 99999