Architecture

Transformer

The neural network architecture that underpins virtually all modern large language models. Introduced in the 2017 paper "Attention Is All You Need," it processes text in parallel using self-attention mechanisms.

The transformer architecture is the foundation of modern AI language models. Introduced by Vaswani et al. at Google in 2017, it replaced earlier recurrent neural networks (RNNs) and LSTMs with a design based entirely on attention mechanisms. This seemingly simple change unlocked massive improvements in both performance and training efficiency, enabling the current generation of large language models.

The key innovation of transformers is the self-attention mechanism, which allows every token in a sequence to attend to every other token simultaneously. Unlike RNNs, which process text one token at a time sequentially, transformers process entire sequences in parallel. This parallelism makes them much faster to train on modern GPU hardware, enabling the scaling that produced models with hundreds of billions of parameters.

A transformer consists of stacked layers, each containing a multi-head self-attention block and a feed-forward neural network, with residual connections and layer normalization. The original architecture had both an encoder (for understanding input) and a decoder (for generating output). GPT models use only the decoder stack, BERT uses only the encoder, and T5 uses both. Modern LLMs like GPT-4, Claude, Llama, and Gemini are all decoder-only transformers, optimized for autoregressive text generation.

The transformer architecture has been remarkably robust. While researchers have proposed hundreds of modifications — different attention patterns, activation functions, positional encodings, and normalization strategies — the core architecture remains essentially unchanged since 2017. Scaling laws have shown that simply making transformers larger and training them on more data reliably improves performance, which has driven the "bigger is better" trend in AI development. Recent innovations like mixture-of-experts, flash attention, and rotary position embeddings are refinements that improve efficiency without abandoning the fundamental transformer design.

Top-p (Nucleus Sampling)

Zero-Shot Learning

Explore more AI concepts in the glossary

Browse Full Glossary