GPTCrunch
Back to AI Glossary
Architecture

Attention Mechanism

The core component of transformer models that allows each token to dynamically focus on relevant parts of the input sequence. It computes weighted relationships between all tokens, enabling the model to understand context and dependencies.

The attention mechanism is what gives transformer models their remarkable ability to understand language. At its core, attention answers the question: "When processing this word, how much should I focus on each other word in the sequence?" For each token, the mechanism computes a set of attention weights over all other tokens, creating a weighted sum that captures relevant context.

Technically, attention operates through three learned projections: queries (Q), keys (K), and values (V). Each token produces a query vector ("what am I looking for?"), a key vector ("what do I contain?"), and a value vector ("what information do I carry?"). The attention score between two tokens is the dot product of the query of one and the key of the other. These scores are scaled and passed through softmax to produce weights, which are then used to create a weighted combination of value vectors. This is the "scaled dot-product attention" formula.

Modern transformers use multi-head attention, which runs several attention operations in parallel with different learned projections. Each "head" can specialize in different types of relationships — one head might focus on syntactic dependencies, another on semantic similarity, another on positional proximity. The outputs of all heads are concatenated and projected to produce the final attention output. GPT-4 class models typically use 32 to 128 attention heads.

Attention is also the main computational bottleneck. Standard attention has O(n squared) complexity with respect to sequence length, meaning doubling the context window quadruples the computation. This has driven research into efficient attention variants like Flash Attention (which optimizes memory access patterns), multi-query attention (which shares keys and values across heads), and grouped-query attention (a middle ground). These optimizations are critical for supporting the 100K+ token context windows in modern models while keeping inference affordable.

Explore more AI concepts in the glossary

Browse Full Glossary