Aug. 11, 2022, 6:28 a.m. | /u/tororo-in

Machine Learning www.reddit.com

There are so many attention mechanisms (guided attention, apple's aft, nystromformer, etc) that alleviate the O(n²) to something like O(n). Why don't recent LMs use these techniques to speed up training and matrix multiplication in the SA layer?

attention machinelearning sota

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Applied Scientist, Control Stack, AWS Center for Quantum Computing

@ Amazon.com | Pasadena, California, USA

Specialist Marketing with focus on ADAS/AD f/m/d

@ AVL | Graz, AT

Machine Learning Engineer, PhD Intern

@ Instacart | United States - Remote

Supervisor, Breast Imaging, Prostate Center, Ultrasound

@ University Health Network | Toronto, ON, Canada

Senior Manager of Data Science (Recommendation Science)

@ NBCUniversal | New York, NEW YORK, United States