Nov. 26, 2023, 9:59 p.m. | /u/lildaemon

Machine Learning www.reddit.com

The only time that the query and key matrices are used is to compute the attention scores. That is $v\_i\^T \\cdot W\_q\^T W\_k v\_j$ But what is used is the matrix $W\_q\^T W\_k$. Why not just replace $W\_q\^T W\_k$ with a single matrix $W\_{qv}$, and learn the matrix that is the product of W\_q\^T W\_k instead of the matrices themselves? How does it help to have two matrices instead of one? And if it helps, why is that not done …

attention compute machinelearning matrix product query the matrix transformer transformer models

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne