Feb. 27, 2024, 2:27 a.m. | /u/programmerChilli

Machine Learning www.reddit.com

Hey folks, we [released gpt-fast](https://www.reddit.com/r/LocalLLaMA/comments/187rfax/gptfast_a_fast_and_hackable_implementation_of/) last December as a hackable "tutorial" implementation of sorts that achieves SOTA decoding performance for text generation.

Since then, we also recently added a Mixtral implementation to gpt-fast as well. Check it out here: https://github.com/pytorch-labs/gpt-fast/tree/main/mixtral-moe

Featuring

- (!) no custom kernels
- int8 and tensor-parallelism support
- still very simple (<150 LOC to support)
- faster decoding than any (non-Groq) API endpoint, at up to 220 tok/s/user.

I also wrote a longer-form explanation of the …

api decoding faster groq loc machinelearning merging simple support tensor work

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Aumni - Site Reliability Engineer III - MLOPS

@ JPMorgan Chase & Co. | Salt Lake City, UT, United States

Senior Data Analyst

@ Teya | Budapest, Hungary

Technical Analyst (Data Analytics)

@ Contact Government Services | Chicago, IL

Engineer, AI/Machine Learning

@ Masimo | Irvine, CA, United States

Private Bank - Executive Director: Data Science and Client / Business Intelligence

@ JPMorgan Chase & Co. | Mumbai, Maharashtra, India