Limitations of RNN/LSTM

Before Transformers took over, RNNs and LSTMs were the standard tools for handling sequential data like text and speech. They were a genuine breakthrough β€” giving neural networks memory β€” but they came with fundamental limitations that capped their performance on long sequences and made them painfully slow to train. These very limitations are what motivated the invention of the Transformer. To understand why Transformers are designed the way they are, you first need to understand what was wrong with what came before.

πŸ’‘ In one line: RNNs and LSTMs process text one step at a time and struggle to remember distant context β€” which makes them slow and weak on long sequences.

(Quick recap: an RNN reads a sequence token by token, carrying a hidden state forward as its "memory." An LSTM is an improved RNN with gates that help it remember longer. See the RNN and LSTM articles for the basics.)

1. Sequential Processing β€” No Parallelism

This is the biggest practical limitation. An RNN must process a sequence one token at a time, because each step depends on the hidden state produced by the previous step. You cannot compute step 5 until step 4 is finished.

This means:

  • Training cannot be parallelised across the sequence.
  • Modern GPUs, which are built for doing thousands of calculations at once, sit largely idle.
  • Training on huge datasets becomes extremely slow.

2. Long-Range Dependencies Fade

In language, words can depend on other words that are far apart. Consider:

"I grew up in France, so I speak fluent ____."

To fill in "French," the model must connect the last word to "France" near the start of the sentence. In an RNN, that information has to travel through every intermediate step, and a little is lost at each one. Over long distances, the early context fades β€” even LSTMs, with their gates, weaken on very long sequences.

3. The Fixed-Size Memory Bottleneck

An RNN squeezes everything it has read so far into a single, fixed-size hidden state vector. For a short sentence that's fine, but for a long paragraph it's like trying to summarise an entire book in one sentence β€” detail is inevitably lost.

This is especially damaging in sequence-to-sequence tasks (like translation), where the encoder must compress the whole input into one vector before the decoder even starts. That single vector becomes a bottleneck.

4. Vanishing and Exploding Gradients

RNNs are trained with backpropagation through time, sending gradients backwards across every step. Over many steps, those gradients tend to shrink toward zero (vanishing) or blow up (exploding). LSTMs were designed to ease this, but they don't fully solve it for very long sequences β€” making long-range learning unstable and slow.

5. No Direct Access to Any Position

An RNN can only "see" earlier tokens indirectly, through the compressed hidden state. There's no way for a token to directly look back at a specific earlier word based on its content. If word 50 needs information from word 2, it can only get a faded echo of it β€” never a direct connection.

Summary of the Limitations

LimitationWhy it's a problem
Sequential processingNo parallelism β†’ slow training, idle GPUs
Long-range dependenciesDistant context fades step by step
Fixed-size memoryOne vector can't hold long sequences
Vanishing/exploding gradientsLong-range learning is unstable
No direct accessA token can't directly attend to any other

How Transformers Fix All of This

The Transformer was designed to remove these exact bottlenecks:

  • Parallel processing β€” it reads the whole sequence at once, not token by token, so training is dramatically faster and GPU-friendly.
  • Attention β€” it lets any token directly look at any other token, no matter how far apart, with no fading and no single-vector bottleneck.

This is why the paper that introduced the Transformer was titled "Attention Is All You Need" β€” attention replaced recurrence entirely, solving the problems above in one stroke. That mechanism is what the rest of this topic explores.

Summary

  • RNNs/LSTMs gave networks memory but had deep limitations on long sequences.
  • They process sequentially (no parallelism β†’ slow) and let long-range context fade.
  • They compress everything into a fixed-size hidden state (a bottleneck) and suffer vanishing/exploding gradients.
  • A token can't directly access any other token β€” only a faded echo through the hidden state.
  • Transformers fix all of this with parallel processing and attention β€” the foundation of modern Generative AI.