Self-Attention

Self-attention is the heart of the Transformer — the mechanism that replaced recurrence and made everything else possible. It lets each token in a sequence look at every other token and decide how much to pay attention to each one, building a context-aware representation. After self-attention, a word's vector is no longer just that word — it's that word understood in the context of the whole sentence.

💡 In one line: Self-attention lets each token gather information from all the other tokens, weighted by how relevant each one is.

The Intuition

Consider this sentence:

"The animal didn't cross the street because it was too tired."

What does "it" refer to — the animal or the street? To represent "it" correctly, the model must look back and focus on "animal." Self-attention does exactly this: for every token, it figures out which other tokens are relevant and blends in their information. Each token's new representation is a weighted mix of all the tokens, with more weight on the relevant ones.

What "Self" Means

It's called self-attention because the tokens attend to other tokens in the same sequence. (This is different from cross-attention, where one sequence attends to a different sequence — covered later.)

How Self-Attention Works

For each token, self-attention follows three steps:

  1. Score — compare the token with every other token to get a relevance score (how related are they?).
  2. Normalize — pass the scores through softmax so they become weights between 0 and 1 that sum to 1.
  3. Blend — take a weighted sum of all the tokens' information, using those weights.

So each token "absorbs" context from the others in proportion to relevance. The comparison itself is done using Query, Key, and Value vectors — the subject of the very next subtopic. For now, the key idea is: score → weight → weighted sum.

The standard formula (full details next) is scaled dot-product attention:

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V


The Attention Matrix

Self-attention produces an n × n matrix of weights (for n tokens):

  • Rows = the token doing the looking (the "query").
  • Columns = the token being looked at.
  • Each value = how much attention flows from one to the other.

Reading a row tells you what that token focused on. In our example, the row for "it" would have a high weight on "animal."

Code Example (Simplified)


This is the core idea in a few lines. (Real self-attention uses learned Q/K/V projections — explained next.)

Why It's So Powerful

  • Direct access — every token can reach every other token in one step (no fading like RNNs).
  • Parallel — all tokens are processed at once.
  • Context-aware — the same word gets a different representation depending on context (e.g. "bank" in "river bank" vs. "bank account").
  • The foundation — it's the core operation of every Transformer.

Summary

  • Self-attention lets each token attend to all tokens in the same sequence, building context-aware representations.
  • It works in three steps: score relevance → softmax into weights → weighted sum.
  • The result is an attention matrix showing which tokens focus on which.
  • The scoring uses Query, Key, Value vectors — covered next.
  • It's direct, parallel, and context-aware, which is why it replaced recurrence and powers all Transformers.