Every large language model you have used—GPT, Claude, Gemini, Llama, Mistral—runs on the same foundational blueprint: the transformer. The architecture is barely seven years old. Before it existed, the state of the art was a different family of networks that could only inch through text one word at a time, losing track of what they had read hundreds of tokens ago. Understanding why those older networks hit a wall, and exactly how transformers removed it, is the single most useful conceptual foundation for anyone working with AI in 2026.
This guide is technical but not math-exclusive. You will leave knowing what self-attention actually computes, why the architecture trains on GPUs 100× faster than its predecessors, and what it genuinely cannot do.
The pre-transformer world: RNNs and LSTMs
Before 2017, the dominant architecture for processing sequential text was the Recurrent Neural Network (RNN) and its more capable cousin, the Long Short-Term Memory network (LSTM).
The core idea of an RNN is simple: process one token at a time and carry a hidden state vector forward. At each step t, the network takes the current token embedding x_t and the previous hidden state h_ and produces a new hidden state h_t:
h_t = tanh(W_h · h_{t-1} + W_x · x_t + b)
Everything the model has "learned" about the past is compressed into this single vector. For short sequences, this works. For a 500-token paragraph, it is a disaster.
The vanishing gradient problem. Backpropagating errors through hundreds of sequential steps causes the gradient signal to shrink exponentially. Parameters responsible for encoding the subject of a sentence barely receive a training signal when the prediction error comes from a predicate 200 tokens later. LSTMs introduced gating mechanisms (input gate, forget gate, output gate) that allowed selected information to survive longer—but they did not eliminate the problem; they attenuated it.
No parallelism. Because h_t depends on h_, you cannot compute step 50 until you have finished step 49. Training a deep LSTM on a long corpus meant waiting for thousands of sequential matrix multiplications. Modern GPUs are designed to do thousands of matrix multiplications simultaneously—RNNs wasted that capability.
Translation as a demonstration of the problem. Machine translation requires encoding a source sentence and decoding a target sentence. The encoder RNN compressed the full source into a single vector. The decoder had to unroll an entire translation conditioned on that one vector. The longer the source, the more meaning was discarded. Researchers added an attention mechanism on top of RNN encoders to let the decoder peek at intermediate hidden states—and this worked much better. The 2017 transformer paper asked the obvious follow-up question: what if attention was the whole model?
"Attention Is All You Need" (2017)
The paper by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain and Google Research appeared at NeurIPS 2017 and became one of the most-cited papers in the history of machine learning.
Its key claim: you do not need recurrence or convolutions. A model built entirely out of attention operations and position-wise feed-forward networks could outperform the best RNN-based translation systems—and train in a fraction of the time.
The insight was that recurrence was a bottleneck, not a necessity. The reason RNNs used sequential hidden states was to propagate context through time. But if you can let every token directly look at every other token simultaneously, context does not need to travel through a chain of hidden states. It arrives in a single step.
The attention mechanism: an intuitive explanation
Imagine a library with a very unusual catalogue system. You walk in with a query—say, you are researching "transformer architecture history." The librarian gives each shelf a small key card summarising what it contains: "2017 NLP papers," "GPU architectures," "RNN history," and so on.
Your query is compared against every key card. Some key cards are highly relevant to your query (high similarity); others are not. The similarity score determines how much attention you pay to each shelf. The actual contents of the relevant shelves are then blended together in proportion to those scores—that blend is your answer.
In transformer terminology:
- Query (Q): what the current token is looking for
- Key (K): what each token advertises about itself
- Value (V): what each token actually contributes if attended to
This analogy is not exact but captures the right intuition: the query-key comparison is a relevance search, and the value-weighted sum is the result of that search.
Self-attention mathematically
In practice, each token's embedding is linearly projected into three separate vectors using learned weight matrices W_Q, W_K, and W_V:
Q = X · W_Q
K = X · W_K
V = X · W_V
where X is the matrix of token embeddings (one row per token). The attention output is then:
Attention(Q, K, V) = softmax( Q · K^T / sqrt(d_k) ) · V
Let us unpack every piece of this formula.
Q · K^T computes the dot product between every query and every key. For a sequence of n tokens, this produces an n × n matrix where entry (i, j) captures how relevant token j is to token i. This is O(n²) in memory and compute—a fact that becomes important at very long contexts.
/ sqrt(d_k) is the scaling factor, where d_k is the dimensionality of the key vectors. Without it, when d_k is large (e.g. 64 or 128), the dot products can become very large in magnitude. Large dot products push the softmax function into a region where gradients are nearly zero—the softmax becomes almost binary (one near-1, all others near-0) and the model stops learning effectively. Dividing by sqrt(d_k) keeps the variance of the dot product roughly constant regardless of dimension.
softmax(...) normalises each row of the scaled attention matrix into a probability distribution. The i-th row sums to 1, and each entry represents how much token i should weight the corresponding token's value.
· V computes the weighted sum of value vectors. The output for token i is a blend of all value vectors, weighted by how much attention token i paid to each other token.
A 3-token walkthrough
Consider the sequence "The cat sat". Suppose we have three tokens with embedding dimension 4 and key dimension 2 (simplified for illustration):
Token 0: "The"
Token 1: "cat"
Token 2: "sat"
Q[1] (query for "cat") = [0.8, 0.3]
K[0] (key for "The") = [0.1, 0.9]
K[1] (key for "cat") = [0.9, 0.1]
K[2] (key for "sat") = [0.5, 0.6]
Dot products for token 1:
with "The": 0.8×0.1 + 0.3×0.9 = 0.08 + 0.27 = 0.35
with "cat": 0.8×0.9 + 0.3×0.1 = 0.72 + 0.03 = 0.75
with "sat": 0.8×0.5 + 0.3×0.6 = 0.40 + 0.18 = 0.58
After dividing by sqrt(2) ≈ 1.41:
[0.25, 0.53, 0.41]
After softmax:
[0.22, 0.35, 0.30] → normalize → [0.25, 0.40, 0.35]
Attention output for "cat" = 0.25 · V["The"] + 0.40 · V["cat"] + 0.35 · V["sat"]
The word "cat" pays the most attention to itself, some to "sat" (the subject and its action), and less to "The". After training on real data, these weights would reflect genuine linguistic relationships.
Multi-head attention
A single attention operation gives you one perspective on the relationships in a sequence. Multi-head attention runs h parallel attention operations simultaneously, each with its own learned projection matrices:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q · W_Q_i, K · W_K_i, V · W_V_i)
Why does this matter? Each head can specialize in a different type of relationship. In practice, researchers have found (by ablating and visualising attention weights) that different heads within the same layer tend to specialise:
- Syntactic heads track subject-verb agreement, dependency arcs, and phrase boundaries
- Coreference heads connect pronouns to their referents ("She said she was tired" — the two "she" tokens get linked)
- Positional heads focus on nearby tokens, functioning almost like a local smoothing window
- Semantic heads track topical similarity across long distances
By concatenating all heads and projecting back to the original dimension with W_O, the model combines these diverse relationship types into a single enriched representation. Typical models use 8 to 128 attention heads, depending on model size.
Positional encoding
Self-attention has a fundamental property that is both a strength and a weakness: it is permutation-invariant. The attention formula treats the input as an unordered set of tokens. Shuffle "The cat sat" into "sat The cat" and, without any additional signal, the model sees the same set of vectors and computes identical attention weights.
But language is a sequence. Word order is meaning. Positional encodings inject order information into the token embeddings before they enter the transformer.
Sinusoidal positional encoding (the original paper's approach) adds a fixed vector to each token embedding based on its position pos and embedding dimension i:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Different dimensions encode position at different frequencies. Low-frequency components let the model distinguish widely-spaced positions; high-frequency components distinguish adjacent ones. The pattern is deterministic, meaning the model can generalise to sequences longer than it saw during training (to a degree).
Learned positional embeddings (used in GPT-2 and BERT) simply train a lookup table of position vectors. This is expressive but does not extrapolate beyond the training context length.
RoPE — Rotary Positional Encoding. Every modern frontier LLM (Llama, Mistral, Gemini, and variants) uses RoPE, introduced by Su et al. (2021). Rather than adding a position vector, RoPE applies a rotation to the Query and Key vectors before computing attention. The rotation angle depends on position, so the dot product between Q at position m and K at position n encodes relative position (m − n) naturally.
RoPE has two important advantages: it handles relative positions instead of absolute ones (which generalises better), and with scaling tricks like YaRN it allows the model to extend its effective context window at inference time—beyond what it was trained on. This is why some frontier models advertise 1M-token contexts despite being trained on shorter sequences.
For a deeper look at how context length interacts with positional encoding and costs, see the context window explainer.
The full transformer block
A single transformer layer consists of four components, applied in sequence, with residual connections and layer normalisation at each stage:
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))
(This is the "Pre-LN" variant used by most modern LLMs; the original paper used Post-LN.)
1. Multi-head self-attention. Described above—every token gathers information from the full context.
2. Add & Norm (residual + layer norm). The residual connection adds the input back to the attention output before normalisation. This is not cosmetic: residual connections are the reason very deep networks can train at all. They give gradients a direct highway back through the network without passing through the attention transformation, preventing vanishing gradients in deep stacks. Layer normalisation stabilises the scale of activations across the model.
3. Feed-forward network (FFN). After attention has gathered context, the FFN processes each token's representation independently. The standard design is two linear layers with a nonlinearity between them:
FFN(x) = max(0, x · W_1 + b_1) · W_2 + b_2
Modern models use SwiGLU or GELU instead of ReLU for the nonlinearity, as these have been shown to improve downstream performance. The hidden dimension of the FFN is typically 4× the model dimension, meaning a model with 4096-dimensional embeddings has an FFN with ~16,384 hidden units.
This is where the model "thinks" after having collected information via attention. Roughly two-thirds of a transformer's parameters live in FFN layers—not in the attention matrices. The attention mechanism routes information; the FFN applies transformation to it.
4. Add & Norm again. Same residual + normalisation pattern after the FFN.
A full LLM stacks many of these blocks—GPT-2 has 48 layers; modern frontier models have 100+ layers—followed by a final layer normalisation and a linear projection to vocabulary logits.
Encoder vs decoder vs encoder-decoder
The transformer framework is flexible. Depending on how you mask the attention pattern and connect the encoder and decoder, you get three distinct model families that dominate different tasks.
| Architecture | Attention type | Representative models | Best for |
|---|---|---|---|
| Encoder-only | Bidirectional (every token sees all) | BERT, RoBERTa, DeBERTa | Classification, embeddings, NLU |
| Decoder-only | Causal / masked (only past tokens) | GPT series, Claude, Gemini, Llama | Text generation, chat, agents |
| Encoder-decoder | Bidirectional encoder + causal decoder | T5, BART, original transformer | Translation, summarisation |
Encoder-only models allow full bidirectional attention: token i can attend to both earlier and later tokens. BERT was trained with a masked language modelling objective (predict randomly masked tokens), which requires knowing both sides of a masked word. This makes encoder-only models excellent for producing rich contextual embeddings. The downside is they cannot autoregressively generate new text.
Decoder-only models impose causal masking in attention: the attention matrix's upper triangle is set to -∞ before softmax, preventing token i from attending to tokens at positions j > i. This is the right constraint for generation because at inference time, token i+1 has not yet been produced—it would be cheating to condition on it. During training, the full sequence is available, so causal masking is applied as a mask rather than a true absence of data. Every model behind a modern chat API—GPT-4o, Claude 4.7, Gemini 3.1—is decoder-only.
Encoder-decoder models use a full bidirectional encoder to process the source (a document to summarise, a sentence to translate) and a causally-masked decoder that generates the output one token at a time, with each decoder layer attending to the full encoder output via cross-attention. T5 (Text-to-Text Transfer Transformer) treated every NLP task—classification, translation, question answering—as a text-to-text problem and remains the cleanest example of this design.
Understanding the relationship between tokens and the attention pattern is fundamental to understanding both the costs and capabilities of each family.
Why transformers scaled
The shift from RNNs to transformers did not just produce better models; it unlocked an entirely new scaling regime. Several factors combined:
Parallelism during training. The entire attention matrix can be computed in parallel for all token pairs simultaneously. Where an RNN required sequential matrix multiplications through the sequence, a transformer computes all of them in a batch. This maps perfectly to GPU and TPU hardware, which excels at large batched matrix operations. Training time dropped from weeks to days for the same data volume.
Scaling laws. Kaplan et al. (2020) at OpenAI showed that LLM loss decreases as a smooth power law as you increase parameters, training compute, and dataset size—with relatively stable exponents. This meant researchers could predict in advance how much better a 10× larger model would be, and extrapolate confidently. No such clean relationship existed for RNNs. The predictability of scaling laws enabled the investment in very large training runs.
Hardware alignment. The core operation in a transformer forward pass is dense matrix multiplication (specifically, the Q·K^T and attention·V products, plus the FFN linear layers). Modern GPU tensor cores are built specifically to execute large matrix multiplications in mixed precision. A transformer using the full breadth of a GPU's FLOP budget; an RNN uses a fraction. This means that scaling transformers translates almost linearly into utilising more hardware.
Emergent capabilities. As model scale increased, new capabilities appeared without being explicitly trained for—code generation, multi-step arithmetic, few-shot learning from examples in the prompt. These emergent properties made each scaling step disproportionately valuable, reinforcing investment in larger models.
The result is models with parameter counts in the hundreds of billions or trillions, trained on trillions of tokens of text, code, and multimodal data.
What transformers still cannot do
Honest assessment matters more in 2026 than uncritical enthusiasm. Transformers have genuine structural limitations.
Quadratic attention cost. The O(n²) scaling of self-attention (in memory and compute) becomes expensive at very long contexts. A sequence of 1 million tokens requires computing and storing a 1M × 1M attention matrix—about 4 TB at float32. In practice, frontier models use approximations: sparse attention (only attend to a subset of positions), sliding window attention (each token attends to a fixed local window), linear attention (approximate the full operation with kernel tricks), and FlashAttention (a memory-efficient kernel that avoids materialising the full matrix). These approximations trade fidelity for feasibility.
No persistent memory across conversations. A transformer's "memory" is entirely contained in its context window. When a conversation ends and a new one begins, the model starts from zero—no accumulated knowledge of you, your preferences, or your past sessions. Solutions exist (retrieval systems, external databases, vector stores) but they are external scaffolding, not intrinsic to the architecture.
Pattern-matching, not symbolic reasoning. Transformers are extremely powerful interpolators over their training distribution. They have learned statistical regularities across enormous corpora. But they do not maintain a world model, do not perform explicit symbolic operations, and do not intrinsically verify logical consistency. Tasks that require multi-step deduction with precise tracking of state—long mathematical proofs, complex program execution—remain challenging and are addressed through external tools, chain-of-thought prompting, and scaffolding rather than architectural change.
Context window as working memory. The context window is the model's working memory for a session. While it has expanded dramatically (from 2,048 tokens in GPT-2 to millions in 2026), it is fundamentally bounded. Very long documents, codebases, or multi-session histories must be retrieved and fitted into the window rather than recalled from persistent internal state.
The 2026 landscape: MoE and what comes next
The transformer block described above remains the foundation of every frontier model—but the form it takes has evolved substantially. The dominant extension in 2026 is Mixture of Experts (MoE).
MoE in one paragraph. A standard transformer has a single FFN per layer. MoE replaces that FFN with E parallel expert FFNs and adds a lightweight router that, for each token, assigns it to the top-k experts (typically k = 1 or 2). Only those experts activate for that token. The model's total parameter count is E times larger than a dense baseline, but the active parameters per token—and thus the FLOPs per forward pass—are comparable to the dense baseline. You get expressivity at the scale of a huge model with the inference cost of a smaller one.
Real deployments. DeepSeek V4-Pro is a 1.6 trillion parameter MoE with 49 billion active parameters per token, achieving scores on agent coding benchmarks that rival much more expensive dense models. Gemini Ultra 2.5 uses an MoE architecture, as does the reported GPT-5.5 family. The shift from dense to MoE is the single biggest architectural trend in production LLMs over the past two years.
Other active research directions. Linear attention approximations (like Mamba's selective state space approach) aim to break the O(n²) wall without sacrificing too much expressivity. Retrieval-augmented generation (RAG) externalises long-term memory. Speculative decoding speeds inference. Continuous batching and paged attention improve serving throughput. But all of these are modifications to the transformer core or wrappers around it—none have replaced it.
Putting it together: what happens when you send a message
When you type a message to any modern LLM, here is what happens inside the transformer stack:
-
Tokenisation. Your text is split into tokens—subword units from a vocabulary of typically 32,000–150,000 entries. Each token maps to an integer ID.
-
Embedding. Token IDs are looked up in an embedding table to produce a dense vector (e.g. 4,096 dimensions). Positional encodings (RoPE for most modern models) are applied.
-
Forward pass through N transformer blocks. At each block, multi-head attention lets every token gather information from the context, and the FFN transforms each token's enriched representation. In a 100-layer model, this happens 100 times.
-
Final layer norm + unembedding. The last layer's output is projected back to vocabulary size via the unembedding matrix. Softmax converts these logits to a probability distribution over the next token.
-
Sampling. A next token is sampled (or selected via greedy decoding, beam search, or temperature-adjusted sampling). That token is appended to the context, and the process repeats.
This autoregressive loop runs once per token in the output. A 200-token response requires 200 full forward passes (though KV caching avoids recomputing attention for already-processed tokens—the model caches the Key and Value vectors from the prompt and reuses them at each generation step).
Reading the architecture through the right lens
The transformer is not magic; it is a specific set of engineering choices that turned out to compose well, parallelise well, and scale well. Its core innovation—replace sequential hidden state propagation with direct, parallel token-to-token attention—was the right abstraction at the right moment when GPUs were becoming commodity hardware and internet-scale text datasets were becoming available.
Understanding self-attention means understanding why frontier LLMs are good at what they are good at (synthesising relevant information across long contexts) and honest about what they are not (persistent memory, exact symbolic reasoning, cheap long-context inference). Every architectural advance in 2026—MoE, RoPE scaling, linear attention approximations—is best understood as a specific answer to one of those structural limitations.
For adjacent fundamentals, see what tokens are, what parameter counts mean, how the context window works, and how a completely different generative architecture—diffusion models—generates images using transformers as one of their key components.