Nov. 17, 2023, 4:02 p.m. | florian

Towards AI - Medium pub.towardsai.net

Multi-Query Attention (MQA) is a type of attention mechanism that can accelerate the speed of generating tokens in the decoder while ensuring model performance.

It is widely used in the era of large language models, many LLMs adopt MQA, such as Falcon, PaLM, StarCoder, and others.

Multi-Head Attention(MHA)

Before introducing MQA, let’s first review the default attention mechanism of the transformer.

Multihead Attention is the default attention mechanism of the transformer model, as shown in Figure 1: …

attention-mechanism deep learning gpt large language models transformers

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York