Feed Forward Network (FFN)
The Transformer Block is the repeating unit that makes up the encoder and decoder stacks. Each block has two main parts: an attention layer and a Feed-Forward Network (FFN), held together by residual connections and layer normalization (the next two subtopics). This article covers the FFN.
The idea is simple but important: attention mixes information between tokens, and then the FFN processes each token on its own. First the tokens communicate, then each one is transformed individually.
💡 In one line: The FFN is a small two-layer network applied to each token separately, adding non-linearity and computing power after attention has shared information.
Where the FFN Sits in the Block
A Transformer block runs in this order:
- (Masked) Self-Attention → Add & Norm
- Feed-Forward Network → Add & Norm
So attention comes first (tokens exchange information), then the FFN (each token is transformed). A useful way to remember it: attention = communicate, FFN = compute.
What is the FFN?
The FFN is a tiny fully-connected network applied to each token's vector. It's just two linear layers with a non-linear activation in between:
FFN(x) = Linear₂( activation( Linear₁(x) ) )The key detail is the expand-then-contract shape:
- Linear₁ expands the vector from
d_modelup to a largerd_ff(commonly 4× bigger). - An activation (ReLU or GELU) adds non-linearity.
- Linear₂ contracts it back down to
d_model.
Position-wise: Applied to Each Token Independently
A crucial point: the same FFN (the same weights) is applied to every token's vector separately. It does not mix tokens together — attention already did that. Because it's applied identically at every position, it's often called a position-wise feed-forward network.
Why the FFN Matters
- Non-linearity — without it, stacking linear layers would collapse into one. The activation lets the model learn complex patterns.
- Capacity — the wide hidden layer gives the model room to compute richer features for each token.
- Depth — it transforms representations between attention layers, deepening understanding.
- Parameters — a large share of a Transformer's total parameters actually live in the FFN layers.
Code Example
Notice the output has the same shape as the input — the FFN transforms each token's vector but keeps the dimensions, so blocks can stack cleanly.
Key Points
| Term | Meaning |
|---|---|
| d_model | Size of each token vector (in and out) |
| d_ff | Size of the wide hidden layer (often 4× d_model) |
| Activation | ReLU or GELU — adds non-linearity |
| Position-wise | Same FFN applied to every token independently |
Summary
- The FFN is the second main part of a Transformer block, after attention.
- It's two linear layers with an activation in between: expand → activate → contract.
- It's position-wise — applied to each token independently, with shared weights.
- It adds non-linearity, capacity, and depth, and holds a large share of the model's parameters.
- Attention communicates between tokens; the FFN computes on each token — together they form the block.