s
Sept. 4, 2023, 1:43 p.m. |

Simon Willison's Weblog simonwillison.net

A practical guide to deploying Large Language Models Cheap, Good *and* Fast


Joel Kang's extremely comprehensive notes on what he learned trying to run Vicuna-13B-v1.5 on an affordable cloud GPU server (a T4 at $0.615/hour). The space is in so much flux right now - Joel ended up using MLC but the best option could change any minute.

Vicuna 13B quantized to 4-bit integers needed 7.5GB of the T4's 16GB of VRAM, and returned tokens at 20/second.

An open challenge …

ai cloud generativeai good gpu guide hour joel language language models large language large language models llama llms mlc notes practical server space vicuna

Senior AI/ML Developer

@ Lemon.io | Remote

Earthquake Forecasting Post-doc in ML at the USGS

@ U. S. Geological Survey | Remote, US

AI Product Manager

@ Uniphore | New York City

Senior Data Analyst, Marketing & Enrollment

@ Adtalem Global Education | Remote, REMOTE, United States

Data Analyst

@ FinClear | Melbourne, Australia

Senior Data Analyst - Subsurface

@ Verisk | Astana, Kazakhstan