March 11, 2024, 7:56 p.m. | /u/benthehuman_

Machine Learning www.reddit.com

I could have sworn I skimmed a paper around a year ago which demonstrated pretty solid performance in transformers where the Value and Key (or Query) weights were the same / shared within each attention layer. I think Linformer does something similar, but I’m not looking for something that tries to solve the quadratic runtime of attention, just something that shows you can reasonable results with shared value and keys. It might’ve even been mentioned in this subreddit. Somehow I …

attention key layer machinelearning paper performance query solid something think transformers value

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Machine Learning Engineer

@ Samsara | Canada - Remote