all AI news
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
May 20, 2024, 4:42 a.m. | Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor R\"uhle, Saravan Rajmohan
cs.LG updates on arXiv.org arxiv.org
Abstract: Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt …
abstract architectures art arxiv attention cs.ar cs.lg decode hardware image image generation language language generation language processing lean memory natural natural language natural language generation natural language processing parameters processing scalable state state-of-the-art models transformer transformer-based models transformers type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Senior Machine Learning Engineer
@ GPTZero | Toronto, Canada
ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)
@ HelloBetter | Remote
Doctoral Researcher (m/f/div) in Automated Processing of Bioimages
@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena
Seeking Developers and Engineers for AI T-Shirt Generator Project
@ Chevon Hicks | Remote
Senior Applied Data Scientist
@ dunnhumby | London
Principal Data Architect - Azure & Big Data
@ MGM Resorts International | Home Office - US, NV