Feed Forward Network (FFN)

The Transformer Block is the repeating unit that makes up the encoder and decoder stacks. Each block has two main parts: an attention layer and a Feed-Forward Network (FFN), held together by residual connections and layer normalization (the next two subtopics). This article covers the FFN.

The idea is simple but important: attention mixes information between tokens, and then the FFN processes each token on its own. First the tokens communicate, then each one is transformed individually.

💡 In one line: The FFN is a small two-layer network applied to each token separately, adding non-linearity and computing power after attention has shared information.

Where the FFN Sits in the Block

A Transformer block runs in this order:

  1. (Masked) Self-Attention → Add & Norm
  2. Feed-Forward Network → Add & Norm

So attention comes first (tokens exchange information), then the FFN (each token is transformed). A useful way to remember it: attention = communicate, FFN = compute.

What is the FFN?

The FFN is a tiny fully-connected network applied to each token's vector. It's just two linear layers with a non-linear activation in between:

FFN(x) = Linear₂( activation( Linear₁(x) ) )


The key detail is the expand-then-contract shape:

  1. Linear₁ expands the vector from d_model up to a larger d_ff (commonly 4× bigger).
  2. An activation (ReLU or GELU) adds non-linearity.
  3. Linear₂ contracts it back down to d_model.

Position-wise: Applied to Each Token Independently

A crucial point: the same FFN (the same weights) is applied to every token's vector separately. It does not mix tokens together — attention already did that. Because it's applied identically at every position, it's often called a position-wise feed-forward network.

Why the FFN Matters

  • Non-linearity — without it, stacking linear layers would collapse into one. The activation lets the model learn complex patterns.
  • Capacity — the wide hidden layer gives the model room to compute richer features for each token.
  • Depth — it transforms representations between attention layers, deepening understanding.
  • Parameters — a large share of a Transformer's total parameters actually live in the FFN layers.

Code Example


Notice the output has the same shape as the input — the FFN transforms each token's vector but keeps the dimensions, so blocks can stack cleanly.

Key Points

TermMeaning
d_modelSize of each token vector (in and out)
d_ffSize of the wide hidden layer (often 4× d_model)
ActivationReLU or GELU — adds non-linearity
Position-wiseSame FFN applied to every token independently

Summary

  • The FFN is the second main part of a Transformer block, after attention.
  • It's two linear layers with an activation in between: expand → activate → contract.
  • It's position-wise — applied to each token independently, with shared weights.
  • It adds non-linearity, capacity, and depth, and holds a large share of the model's parameters.
  • Attention communicates between tokens; the FFN computes on each token — together they form the block.