all AI news
A practical guide to deploying Large Language Models Cheap, Good *and* Fast
Joel Kang's extremely comprehensive notes on what he learned trying to run Vicuna-13B-v1.5 on an affordable cloud GPU server (a T4 at $0.615/hour). The space is in so much flux right now - Joel ended up using MLC but the best option could change any minute.
Vicuna 13B quantized to 4-bit integers needed 7.5GB of the T4's 16GB of VRAM, and returned tokens at 20/second.
An open challenge …