The Next Word
How a transformer turns a sentence into a prediction — every step, every number.
A transformer does exactly one thing: given some words, predict the next word. Everything it does is in service of that. So let’s give it a sentence and watch it predict.
By the end you’ll have computed, by hand, why the answer comes out “mat”. There are five stages:
- 1Embeddings + position — turn each word, and where it sits, into numbers
- 2Attention — let words gather context from each other ★ the heart
- 3MLP — let each word digest what it gathered
- 4Layers — stack 2+3 many times
- 5Prediction — turn the result into the next word
Stage 1 · Embeddings: words become numbers
A computer can’t compute on the letters c-a-t. So every word is assigned a short list of numbers — a vector. Words used similarly get similar vectors, so meaning becomes geometry: nearby vectors = related words.
Picture that space directly. Words don’t scatter at random — they cluster by meaning: animals near animals, actions near actions, the little glue words off in their own corner.
Our five words live in a space just like this — we only need to write down their coordinates. With four numbers each, here they are:
| the | 1 | 0 | 0 | 1 |
| cat | 0 | 2 | 1 | 0 |
| sat | 1 | 0 | 2 | 0 |
| on | 0 | 1 | 0 | 1 |
| the | 1 | 0 | 0 | 1 |
Real models use ~4000 numbers per word; we use 4 so we can see them. Note the two "the"s have identical content here — in a moment, position will pull them apart.
Formally, the content embeddings live in a learned matrix $E$ with one row per vocabulary word. Looking up token $t_i$ selects its row (we treat every token vector as a row throughout — the convention papers and code use):
$$ e_i = E_{t_i,\,:}, \qquad E \in \mathbb{R}^{|V| \times d} \label{eq:embed} $$where $d$ is the model dimension (4 in our toy example, ~4000 in real models) and $|V|$ is the vocabulary size. This $e_i$ captures what the word means — but not yet where it sits.
Where each word sits: positional encoding
Here’s a catch we must fix before attention. The attention step (next) treats the words as an unordered set — on its own it cannot tell "the cat sat" from "sat cat the." But order is meaning (“dog bites man” ≠ “man bites dog”). So we stamp each word with where it sits by adding a position vector $p_i$ to its content embedding:
Each slot gets its own distinct $p_i$, so the same word in two places ends up with two different input vectors $x_i$. (Real models build $p_i$ from smooth functions — sinusoids, or rotations like RoPE; we use simple stamps so you can add them by hand.)
| the @0 | 0 | 0 | 0 | 0 |
| cat @1 | 1 | 0 | 0 | 0 |
| sat @2 | 0 | 1 | 0 | 0 |
| on @3 | 0 | 0 | 1 | 0 |
| the @4 | 0 | 0 | 0 | 1 |
| the @0 | 1 | 0 | 0 | 1 |
| cat @1 | 1 | 2 | 1 | 0 |
| sat @2 | 1 | 1 | 2 | 0 |
| on @3 | 0 | 1 | 1 | 1 |
| the @4 | 1 | 0 | 0 | 2 |
The two "the"s were identical as content, but now differ: [1,0,0,1] vs [1,0,0,2]. Position pulled them apart — before attention even runs.
"the"s already differ. ✓Stage 2 · Attention ★ the heart
Here’s the problem attention exists to solve. To predict what follows "the", the model must know what the sentence is about — but the final "the"’s vector knows nothing about “cat” or “sat” on its own. Attention lets it look at the other words and pull in their meaning.
"the", since that’s the one predicting the next token.Query, Key, Value — three roles per word
Every word produces three vectors by multiplying its input vector $x$ with three learned matrices. Think of searching YouTube: your query (what you type) is matched against each video’s key (its title), and you receive its value (the video itself).
$$ q = x\,W_Q \qquad k = x\,W_K \qquad v = x\,W_V \label{eq:qkv} $$Each weight matrix has shape $d\times d_k$ (here $4\times 3$), turning a 4-dim word vector into a 3-dim query, key, or value.
So one input vector fans out into three — a query, a key, and a value, each through its own learned matrix:
Let’s make that concrete — the query for our last word, worked out by hand:
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 2 | 1 |
Doing the same with $W_K$ and $W_V$ for every word gives a key and a value for each. (Same arithmetic, just repeated — we’ll show the results.)
Step A — how relevant is each word? (dot products)
We score relevance by the dot product of our query with each word’s key — one number measuring alignment. Bigger = more relevant.
$$ \text{score}_i = q \cdot k_i = \sum_j q_j\,k_{i,j} \label{eq:score} $$| word | key k | q · k |
| the @0 | 101 | 2 |
| cat @1 | 331 | 10 |
| sat @2 | 232 | 10 |
| on @3 | 122 | 7 |
| the @4 | 102 | 3 |
e.g. cat: $1{\cdot}3 + 2{\cdot}3 + 1{\cdot}1 = 10$. “cat” and “sat” tie for highest — the subject and the verb.
Step B — turn scores into percentages (softmax)
Raw scores aren’t percentages. Softmax turns any list of numbers into positive fractions that sum to 100% — bigger scores get exponentially bigger shares. (We first divide by $\sqrt{d_k}=\sqrt3$, a standard step that keeps scores from blowing up.)
$$ \text{weight}_i = \frac{e^{\,\text{score}_i/\sqrt{d_k}}}{\sum_j e^{\,\text{score}_j/\sqrt{d_k}}} \label{eq:softmax} $$The model splits its attention almost equally between “cat” (the subject) and “sat” (the verb) — the two most informative words for guessing what comes next. That’s exactly the context you’d want.
Step C — mix the values by those weights
Now build the new vector for "the" as a weighted blend of every word’s value, using the percentages above. This blended vector is “the” with context baked in.
| the @0 | 3 | 0 | 3 |
| cat @1 | 2 | 4 | 1 |
| sat @2 | 2 | 5 | 1 |
| on @3 | 1 | 3 | 2 |
| the @4 | 4 | 0 | 5 |
| 1.94 | 4.32 | 1.12 |
The big middle value (4.32) comes straight from cat & sat, whose values are large in that slot. “the” has absorbed the subject-and-verb meaning.
The famous one-line formula just packs Steps A–C — Eqs. $\eqref{eq:score}$, $\eqref{eq:softmax}$, $\eqref{eq:blend}$ — into one expression, and now you can read it:
$$ \text{Attention}(Q,K,V)=\underbrace{\text{softmax}\!\Big(\tfrac{QK^{\top}}{\sqrt{d_k}}\Big)}_{\text{the weights (A,B)}}\,V \quad\underbrace{}_{\text{the blend (C)}} \label{eq:attn} $$And the three steps you just did are really one motion — ask, score, blend:
Multi-head attention. A word often needs several kinds of context at once (subject? tense? tone?). So this whole operation runs $H$ times in parallel — each “head” $h$ with its own $W_Q^h,W_K^h,W_V^h$ — and the results are concatenated and mixed by an output matrix $W_O$:
$$ \text{head}_h = \text{Attention}\!\big(XW_Q^h,\; XW_K^h,\; XW_V^h\big) \label{eq:head} $$ $$ \text{MultiHead}(X) = \big[\,\text{head}_1 \,\Vert\, \text{head}_2 \,\Vert\, \cdots \,\Vert\, \text{head}_H\,\big]\,W_O \label{eq:mha} $$where $\Vert$ means “stack side by side.” Same machinery as above, just repeated and recombined.
"the" is now context-aware: mostly cat + sat. ✓Stage 3 · MLP: each word digests privately
Attention was social — words traded information. Now each word, on its own, processes what it gathered. The output vector is passed through a small two-step network: expand to a larger size (room to compute combinations), apply a nonlinearity, then squeeze back down.
In symbols, it’s two matrix multiplies with a nonlinearity $\phi$ in between — applied to each word’s vector $x$ independently:
$$ \text{MLP}(x) = \phi\!\big(x\,W_1 + b_1\big)\,W_2 + b_2 \label{eq:mlp} $$$W_1$ projects up to a larger hidden size (the “room to think”), $\phi$ is a nonlinearity (e.g. ReLU or GELU) that lets the network compute non-linear combinations, and $W_2$ projects back down. Without $\phi$, the two matrices would collapse into one and the layer could only do linear maps.
Stage 4 · Layers: stack it deep
One layer gives shallow understanding. So we repeat the same (attention + MLP) block many times — dozens, in real models. Each layer’s output is the next layer’s input, so understanding compounds: early layers catch grammar, later layers assemble abstract meaning.
One layer is two sub-steps, and each adds its result back to its input — a residual connection — so information is never thrown away:
$$ x' = x + \text{MultiHead}(x) \qquad\quad y = x' + \text{MLP}(x') \label{eq:resid} $$Stacking $N$ such layers just feeds each one’s output into the next, starting from the embeddings $x^{(0)}$:
$$ x^{(\ell+1)} = \text{Layer}_\ell\big(x^{(\ell)}\big), \qquad \ell = 0, 1, \dots, N-1 \label{eq:stack} $$Stage 5 · Prediction: numbers back into a word
After the last layer, we take the vector at the final position (it now carries the whole sentence’s context) and score every word in the vocabulary. Each candidate word has its own output vector; the dot product with our context vector is its logit (raw score). Softmax — the same one as Eq. $\eqref{eq:softmax}$ — turns logits into probabilities.
Concretely, an unembedding matrix $W_U$ (one column per vocabulary word) turns the final vector $x^{(N)}_{\text{last}}$ into a logit per word; softmax turns those into probabilities:
$$ z = x^{(N)}_{\text{last}}\, W_U, \qquad W_U \in \mathbb{R}^{d \times |V|},\quad z \in \mathbb{R}^{|V|} \label{eq:logits} $$ $$ P(\text{next}=w) = \frac{e^{\,z_w}}{\sum_{w'} e^{\,z_{w'}}}, \qquad \widehat{w} = \arg\max_w P(\text{next}=w) \label{eq:predict} $$Each column of $W_U$ is a word’s “signature”; the dot product $z_w$ measures how well the context vector matches it. We then pick the highest-probability word (or sample from the distribution).
It’s the attention picture again, now aimed at the vocabulary — the context vector tries on each candidate word’s signature and keeps the best fit:
And the arithmetic behind it — one dot product per candidate word:
| word | Wₒ column | logit (dot) |
| mat | 2-2-1 | -5.88 |
| rug | -2-11 | -7.08 |
| floor | -1-1-1 | -7.38 |
| carpet | -2-10 | -8.20 |
e.g. mat: $2(1.94) - 2(4.32) - 1(1.12) = -5.88$. Logits are usually negative — only their differences matter. After softmax over the vocabulary:
"the cat sat on the" → mat (62%). You just predicted a word, by hand.
Then the loop: append "mat" to the sentence and run the entire machine again to get the next word. And again. That predict→append→repeat loop is how an LLM writes whole paragraphs — one word at a time.
The whole machine, in one view
Every stage you just computed by hand, assembled into a single picture — including the loop that turns one prediction into a whole paragraph:
The one-paragraph version
A transformer turns each word into a vector, then repeatedly lets the words look at each other and exchange context (attention) and individually digest it (the MLP). Stacking that gather-and-digest cycle builds a deep, context-aware understanding. It then turns the final vector into a probability over the next word, picks one, appends it, and repeats. The one idea that makes it all work is attention (Eq. $\eqref{eq:attn}$): score relevance with a dot product, softmax it into weights, and blend the values — which is exactly the arithmetic you just did for "the cat sat on the → mat."
Bridge to the modern transformer
Everything above is the timeless core. A modern model — like the one dissected in Aleksa Gordić’s “The Life of a Token” — keeps all five stages but swaps fancier machinery into some of them. Here’s the name-map (each name links to its source paper), so the jargon doesn’t trip you up when you read it:
| In this doc | Name you’ll see | What’s different |
|---|---|---|
| Positional encoding (add $p_i$, Eq. $\eqref{eq:pos}$) | RoPE · YaRN | Instead of adding a position vector, modern models rotate each query and key by an angle that grows with position; YaRN stretches this to very long contexts. |
| MLP (Eq. $\eqref{eq:mlp}$) | SwiGLU · GeGLU | A gated two-branch MLP: one branch goes through the nonlinearity and multiplies (“gates”) the other. Same role, a bit more expressive. |
| Normalization (the step we skipped) | RMSNorm · LayerNorm | Rescales each vector to keep numbers stable between layers. RMSNorm is the common modern choice. |
| Multi-head attention (Eq. $\eqref{eq:mha}$) | MHA · GQA · MQA | The same heads — but several heads share one set of keys/values to save memory (GQA = grouped-query, the modern default). |
| Embeddings, attention core, residuals, prediction | (unchanged) | These are identical. The dot-product attention you computed by hand is exactly what runs inside every modern model. |
That’s the whole trick: if you can follow the five stages here, you can read his blog — you’ll just be meeting new names for parts you already understand.
Where to go next
This document is the on-ramp. Here’s a path onward, roughly easy → deep:
- Attention in transformers, visually explained — 3Blue1Brown. The single best animation of the attention mechanism.
- But what is a GPT? — 3Blue1Brown. The big picture, beautifully animated.
- The Illustrated Transformer — Jay Alammar. The classic diagram-driven walkthrough (original 2017 architecture).
- The Life of a Token — Aleksa Gordić. The modern dense transformer in depth (RoPE/YaRN, RMSNorm, GQA) — the blog this doc bridges to.
- AI by Hand — Prof. Tom Yeh. Worksheets that make you compute the matrices on paper.
- Let’s build GPT from scratch — Andrej Karpathy. Code a working transformer line by line.
- The Annotated Transformer — Harvard NLP. The original paper as runnable PyTorch.
- Attention Is All You Need — Vaswani et al., 2017. The paper that started it.
- RoFormer (RoPE) · YaRN · GLU Variants (GeGLU/SwiGLU) · RMSNorm · GQA — the modern pieces named in the bridge above.
External links were accurate at the time of writing; if one moves, a title search will find it.