BERT (Encoder)

BERT — short for Bidirectional Encoder Representations from Transformers — is an encoder-only Transformer released by Google in 2018. It uses only the encoder stack and reads text bidirectionally (both left and right at once) to build a deep understanding of language. BERT is built for understanding tasks — classification, question answering, search — rather than generating text.

💡 In one line: BERT is an encoder-only Transformer that reads text in both directions to deeply understand it — great for understanding tasks, not generation.

Encoder-Only Architecture

BERT keeps just the encoder half of the Transformer. That gives it two defining traits:

  • Bidirectional self-attention — every token can attend to all other tokens, both before and after it.
  • No decoder — so BERT doesn't generate text token by token. Instead, it outputs rich contextual embeddings (one meaningful vector per token) that downstream tasks build on.

What Makes BERT Special: Bidirectionality

Earlier language models read left to right only. BERT reads the whole sentence at once, from both directions, so each word's representation is shaped by its full context. This resolves ambiguity beautifully:

  • "river bank" vs. "bank account" — BERT distinguishes the two meanings of "bank" because it sees the surrounding words on both sides.

How BERT is Pre-trained

BERT learns from huge amounts of text using two clever self-supervised objectives:

  1. Masked Language Modeling (MLM) — randomly hide about 15% of the tokens and train BERT to predict them from the surrounding context. This forces true bidirectional understanding.
    • "The [MASK] sat on the mat." → predict "cat".
  2. Next Sentence Prediction (NSP) — given two sentences, predict whether the second naturally follows the first. (Helpful for sentence-pair tasks; later variants like RoBERTa dropped it.)

Fine-tuning BERT for Tasks

The power of BERT is pre-train once, adapt many times. You take the pre-trained model and fine-tune it for a specific task by adding a small "head" on top:

  • Text classification (e.g. sentiment analysis)
  • Named Entity Recognition (tagging people, places)
  • Question Answering (finding answers in a passage)
  • Sentence similarity / embeddings

Special Tokens

BERT uses a few special tokens:

  • [CLS] — added at the start; its output vector is used for classification.
  • [SEP] — separates two sentences.
  • [MASK] — marks a hidden token during MLM.

Code Example


This runs BERT directly (with pip install transformers) and shows it filling in the masked word from context.

BERT vs. GPT

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
ReadsBidirectionallyLeft to right
Trained onMasked language modelingNext-token prediction
Best atUnderstanding tasksGenerating text
Generates text?NoYes

BERT Variants

The original BERT inspired many improvements: RoBERTa (better training), DistilBERT (smaller and faster), and ALBERT (more parameter-efficient), among others.

Summary

  • BERT is an encoder-only Transformer that reads text bidirectionally.
  • It's pre-trained with Masked Language Modeling (and originally Next Sentence Prediction).
  • It produces rich contextual embeddings and is fine-tuned for understanding tasks like classification, QA, and NER.
  • It uses special tokens [CLS], [SEP], and [MASK].
  • BERT is for understanding, while GPT (decoder-only) is for generation — the next variant.