Activation Functions
Inside every neuron of a neural network, two things happen: the inputs are combined into a weighted sum, and that sum is passed through an activation function. The activation function decides the neuron's final output — and, more importantly, it gives the network the power to learn complex, non-linear patterns. Without activation functions, even a hundred-layer network would be no smarter than a single straight line.
💡 In one line: An activation function decides a neuron's output and adds the non-linearity that lets neural networks learn complex patterns.
Why Do We Need Activation Functions?
Recall how a neuron works:
output = f( w₁x₁ + w₂x₂ + … + b )The part inside the brackets is just a linear combination of inputs. If f did nothing (or was itself linear), then stacking many layers would still only produce a linear result — the whole deep network would collapse into a single straight-line model, unable to handle images, language, or any real-world complexity.
The activation function f breaks this limitation by adding non-linearity. This is what lets networks bend, curve, and model the rich patterns found in real data.
Common Activation Functions
1. Sigmoid
f(x) = 1 / (1 + e⁻ˣ)Squashes any input into a value between 0 and 1. This makes it useful for outputs that represent a probability or a yes/no decision.
- Pros: smooth, easy to interpret as a probability.
- Cons: suffers from the vanishing gradient problem, and its output isn't zero-centred, which can slow learning.
2. Tanh (Hyperbolic Tangent)
f(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)Similar S-shape to sigmoid, but outputs range from −1 to 1. Because it's zero-centred, it usually works better than sigmoid in hidden layers.
- Pros: zero-centred, stronger gradients than sigmoid.
- Cons: still suffers from vanishing gradients at the extremes.
3. ReLU (Rectified Linear Unit)
f(x) = max(0, x)Outputs the input directly if it's positive, and 0 otherwise. ReLU is the most widely used activation in hidden layers because it's simple and fast.
- Pros: very fast to compute, avoids vanishing gradients for positive values, encourages sparse (efficient) activations.
- Cons: the "dying ReLU" problem — neurons that only ever receive negative inputs get stuck outputting 0 and stop learning.
4. Leaky ReLU
f(x) = x if x > 0, else 0.01xA small fix for ReLU: instead of a flat 0 for negative inputs, it allows a small slope. This keeps neurons "alive" and learning.
5. Softmax
Used in the output layer for multi-class classification. It converts a set of raw scores into probabilities that sum to 1, so the network can say "70% cat, 20% dog, 10% rabbit."
Comparison Table
| Function | Range | Best used in | Key weakness |
|---|---|---|---|
| Sigmoid | 0 to 1 | Binary output layer | Vanishing gradient |
| Tanh | −1 to 1 | Hidden layers, RNNs | Vanishing gradient |
| ReLU | 0 to ∞ | Hidden layers (default) | Dying ReLU |
| Leaky ReLU | −∞ to ∞ | Hidden layers (fix for ReLU) | Extra setting to tune |
| Softmax | 0 to 1 (sums to 1) | Multi-class output layer | Output layer only |
Which One Should You Use?
A simple practical guide:
- Hidden layers → start with ReLU. If you hit dying-ReLU issues, switch to Leaky ReLU.
- Binary (yes/no) output → Sigmoid.
- Multi-class output → Softmax.
- Sequence models (RNNs) → Tanh is still common.
📌 Rule of thumb: ReLU for hidden layers, Sigmoid or Softmax for the output layer depending on the task.
The Vanishing Gradient Problem
Sigmoid and Tanh flatten out at their extremes — for very large or very small inputs, their slope becomes almost zero. During backpropagation, these tiny slopes make the weight updates shrink to nearly nothing, so the early layers of a deep network learn extremely slowly or stop learning entirely. This is called the vanishing gradient problem.
ReLU largely avoids this for positive inputs, which is a major reason it became the default choice for deep networks.
Summary
- An activation function decides a neuron's output and adds the non-linearity that lets networks learn complex patterns.
- Without it, a deep network would collapse into a simple linear model.
- Sigmoid (0→1) and Tanh (−1→1) are smooth S-curves but suffer from vanishing gradients.
- ReLU is the fast, default choice for hidden layers, with Leaky ReLU fixing its "dying" problem.
- Softmax turns outputs into probabilities for multi-class classification — use ReLU in hidden layers, Sigmoid/Softmax at the output.