Oct. 16, 2023, 6:58 a.m. | /u/jimmymvp

Machine Learning www.reddit.com

I am just puzzled how does one efficiently query a huge transformer model such that so many users can be served at the same time. Is it queried on per user basis? (modulo some caching) If yes, how expensive is this? If no, what the hell is going on? :D Are there any good resources on this? (how to build large scale apps with big models, from scratch). Somehow this doesn't really fit the standard data-intensive system design process, or …

app bard caching chat design foundational model gpt machinelearning per query scale transformer transformer model

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

RL Analytics - Content, Data Science Manager

@ Meta | Burlingame, CA

Research Engineer

@ BASF | Houston, TX, US, 77079