July 29, 2022, 2:58 p.m. | /u/jwngx

Machine Learning www.reddit.com

I'm trying to understand multi-head attention but don't quite get how queries, keys, and values are projected to different subspaces. More specifically, are the same weight matrices used for each head, or is a different matrix used for each head?

The [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) shows eight sets of weight matrices being used for eight heads. But other implementations I've seen (The [Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/#full-model) and Gordic Aleksa's [implementation](https://github.com/gordicaleksa/pytorch-original-transformer/blob/main/models/definitions/transformer_model.py), as well as his video on his popular channel The AI Epiphany) seem to …

attention head machinelearning multi-head attention work

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Software Engineer, Generative AI (C++)

@ SoundHound Inc. | Toronto, Canada