Nov. 17, 2023, 4:02 p.m. | florian

Towards AI - Medium

Multi-Query Attention (MQA) is a type of attention mechanism that can accelerate the speed of generating tokens in the decoder while ensuring model performance.

It is widely used in the era of large language models, many LLMs adopt MQA, such as Falcon, PaLM, StarCoder, and others.

Multi-Head Attention(MHA)

Before introducing MQA, let’s first review the default attention mechanism of the transformer.

Multihead Attention is the default attention mechanism of the transformer model, as shown in Figure 1: …

attention-mechanism deep learning gpt large language models transformers

Machine Learning Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, Ca

Team Lead Data Integrity

@ Maximus | Remote, United States

Machine Learning Research Scientist

@ Bosch Group | Pittsburgh, PA, United States

Data Engineer

@ Autodesk | APAC - India - Bengaluru - Sunriver

Data Engineer II

@ Mintel | Belfast

Data Engineer

@ Vector Limited | Auckland, New Zealand