all AI news
Quoting Andrej Karpathy
Simon Willison's Weblog simonwillison.net
llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers [...] TLDR at batch_size=1 (i.e. just generating a single stream of prediction on your computer), the inference is super duper memory-bound. The on-chip compute units are twiddling their thumbs while sucking model weights through a straw from DRAM. [...] A100: 1935 GB/s memory bandwidth, 1248 TOPS. MacBook M2: 100 GB/s, 7 TFLOPS. The compute is ~200X but the memory bandwidth only ~20X. So …
a100 ai andrej karpathy andrejkarpathy chip compute computer computers cpp generativeai inference llama llms memory people prediction small through