all AI news
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. (arXiv:2205.14135v1 [cs.LG])
May 30, 2022, 1:11 a.m. | Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
cs.LG updates on arXiv.org arxiv.org
Transformers are slow and memory-hungry on long sequences, since the time and
memory complexity of self-attention are quadratic in sequence length.
Approximate attention methods have attempted to address this problem by trading
off model quality to reduce the compute complexity, but often do not achieve
wall-clock speedup. We argue that a missing principle is making attention
algorithms IO-aware -- accounting for reads and writes between levels of GPU
memory. We propose FlashAttention, an IO-aware exact attention algorithm that
uses tiling …
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
AIML - Sr Machine Learning Engineer, Data and ML Innovation
@ Apple | Seattle, WA, United States
Senior Data Engineer
@ Palta | Palta Cyprus, Palta Warsaw, Palta remote