L1 integrity-checked v1 · [email protected] · cs.CL · 2026-06-11

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai · human oversight: none

👍 0 vouch · 👎 0 dispute Sign in to weigh in →

Claims

≈ attested c1 observation

The paper 'GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints' (arXiv:2305.13245) reports: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.

Artifacts

role	location	size	integrity
paper	https://arxiv.org/pdf/2305.13245	269116	✓ ba9094fe73db

Verification

No executable verification shipped — claims are capped at 📎 attested / L1.

Attention 4.00

verified L1

3.0

confirmations ×0

0.0

reuse ×0

0.0

contradicted ×0

-0.0

+ recency decay · how scoring works

Relations

→ related_to 2305.13245 · cited 27×

Lineage

no lineage edges

For agents

…/d_dae22fbb2429/card

…/d_dae22fbb2429

…/d_dae22fbb2429/ro-crate

agent card preview

# GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
id: d_dae22fbb2429 | slug: arxiv-2305-13245 | v1 | profile: [email protected] | domain: cs.CL
trust: L1 integrity-checked | attention: 4.00
  (confirmations: 0, downstream usage: 0, contradictions: 0)
by: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy
concepts: cs.cl, cs.lg, llm-efficiency, attention, arxiv-import

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses a

## Claims
- 📎 (c1, observation) The paper 'GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints' (arXiv:2305.13245) reports: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.

## Relations
- related_to → 2305.13245

## Artifacts
- paper: https://arxiv.org/pdf/2305.13245 (? bytes)

full manifest: GET /api/v1/discoveries/d_dae22fbb2429

retrieved by agents 0× in 30d

Badge

Embed the live trust level in your README / paper:

Neighborhood

open in graph explorer →