Feed Forward Network (FFN)

Last updated: Jun 27, 2026

Author :

Vinay Adari

Feed Forward Network (FFN)

The Transformer Block is the repeating unit that makes up the encoder and decoder stacks. Each block has two main parts: an attention layer and a Feed-Forward Network (FFN), held together by residual connections and layer normalization (the next two subtopics). This article covers the FFN.

The idea is simple but important: attention mixes information between tokens, and then the FFN processes each token on its own. First the tokens communicate, then each one is transformed individually.

💡 In one line: The FFN is a small two-layer network applied to each token separately, adding non-linearity and computing power after attention has shared information.

Where the FFN Sits in the Block

A Transformer block runs in this order:

(Masked) Self-Attention → Add & Norm
Feed-Forward Network → Add & Norm

So attention comes first (tokens exchange information), then the FFN (each token is transformed). A useful way to remember it: attention = communicate, FFN = compute.

What is the FFN?

The FFN is a tiny fully-connected network applied to each token's vector. It's just two linear layers with a non-linear activation in between:

FFN(x) = Linear₂( activation( Linear₁(x) ) )

The key detail is the expand-then-contract shape:

Linear₁ expands the vector from d_model up to a larger d_ff (commonly 4× bigger).
An activation (ReLU or GELU) adds non-linearity.
Linear₂ contracts it back down to d_model.

Position-wise: Applied to Each Token Independently

A crucial point: the same FFN (the same weights) is applied to every token's vector separately. It does not mix tokens together — attention already did that. Because it's applied identically at every position, it's often called a position-wise feed-forward network.

Why the FFN Matters

Non-linearity — without it, stacking linear layers would collapse into one. The activation lets the model learn complex patterns.
Capacity — the wide hidden layer gives the model room to compute richer features for each token.
Depth — it transforms representations between attention layers, deepening understanding.
Parameters — a large share of a Transformer's total parameters actually live in the FFN layers.

Code Example

Notice the output has the same shape as the input — the FFN transforms each token's vector but keeps the dimensions, so blocks can stack cleanly.

Key Points

Term	Meaning
d_model	Size of each token vector (in and out)
d_ff	Size of the wide hidden layer (often 4× d_model)
Activation	ReLU or GELU — adds non-linearity
Position-wise	Same FFN applied to every token independently

Summary

The FFN is the second main part of a Transformer block, after attention.
It's two linear layers with an activation in between: expand → activate → contract.
It's position-wise — applied to each token independently, with shared weights.
It adds non-linearity, capacity, and depth, and holds a large share of the model's parameters.
Attention communicates between tokens; the FFN computes on each token — together they form the block.