Feb. 27, 2024, 2:27 a.m. | /u/programmerChilli

Machine Learning www.reddit.com

Hey folks, we [released gpt-fast](https://www.reddit.com/r/LocalLLaMA/comments/187rfax/gptfast_a_fast_and_hackable_implementation_of/) last December as a hackable "tutorial" implementation of sorts that achieves SOTA decoding performance for text generation.

Since then, we also recently added a Mixtral implementation to gpt-fast as well. Check it out here: https://github.com/pytorch-labs/gpt-fast/tree/main/mixtral-moe

Featuring

- (!) no custom kernels
- int8 and tensor-parallelism support
- still very simple (<150 LOC to support)
- faster decoding than any (non-Groq) API endpoint, at up to 220 tok/s/user.

I also wrote a longer-form explanation of the …

api decoding faster groq loc machinelearning merging simple support tensor work

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne