Activation Functions

Inside every neuron of a neural network, two things happen: the inputs are combined into a weighted sum, and that sum is passed through an activation function. The activation function decides the neuron's final output — and, more importantly, it gives the network the power to learn complex, non-linear patterns. Without activation functions, even a hundred-layer network would be no smarter than a single straight line.

💡 In one line: An activation function decides a neuron's output and adds the non-linearity that lets neural networks learn complex patterns.

Why Do We Need Activation Functions?

Recall how a neuron works:

output = f( w₁x₁ + w₂x₂ + … + b )

The part inside the brackets is just a linear combination of inputs. If f did nothing (or was itself linear), then stacking many layers would still only produce a linear result — the whole deep network would collapse into a single straight-line model, unable to handle images, language, or any real-world complexity.

The activation function f breaks this limitation by adding non-linearity. This is what lets networks bend, curve, and model the rich patterns found in real data.

Common Activation Functions

1. Sigmoid

f(x) = 1 / (1 + e⁻ˣ)

Squashes any input into a value between 0 and 1. This makes it useful for outputs that represent a probability or a yes/no decision.

  • Pros: smooth, easy to interpret as a probability.
  • Cons: suffers from the vanishing gradient problem, and its output isn't zero-centred, which can slow learning.

2. Tanh (Hyperbolic Tangent)

f(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)

Similar S-shape to sigmoid, but outputs range from −1 to 1. Because it's zero-centred, it usually works better than sigmoid in hidden layers.

  • Pros: zero-centred, stronger gradients than sigmoid.
  • Cons: still suffers from vanishing gradients at the extremes.

3. ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Outputs the input directly if it's positive, and 0 otherwise. ReLU is the most widely used activation in hidden layers because it's simple and fast.

  • Pros: very fast to compute, avoids vanishing gradients for positive values, encourages sparse (efficient) activations.
  • Cons: the "dying ReLU" problem — neurons that only ever receive negative inputs get stuck outputting 0 and stop learning.

4. Leaky ReLU

f(x) = x if x > 0, else 0.01x

A small fix for ReLU: instead of a flat 0 for negative inputs, it allows a small slope. This keeps neurons "alive" and learning.

5. Softmax

Used in the output layer for multi-class classification. It converts a set of raw scores into probabilities that sum to 1, so the network can say "70% cat, 20% dog, 10% rabbit."

Comparison Table

FunctionRangeBest used inKey weakness
Sigmoid0 to 1Binary output layerVanishing gradient
Tanh−1 to 1Hidden layers, RNNsVanishing gradient
ReLU0 to ∞Hidden layers (default)Dying ReLU
Leaky ReLU−∞ to ∞Hidden layers (fix for ReLU)Extra setting to tune
Softmax0 to 1 (sums to 1)Multi-class output layerOutput layer only

Which One Should You Use?

A simple practical guide:

  • Hidden layers → start with ReLU. If you hit dying-ReLU issues, switch to Leaky ReLU.
  • Binary (yes/no) outputSigmoid.
  • Multi-class outputSoftmax.
  • Sequence models (RNNs)Tanh is still common.

📌 Rule of thumb: ReLU for hidden layers, Sigmoid or Softmax for the output layer depending on the task.

The Vanishing Gradient Problem

Sigmoid and Tanh flatten out at their extremes — for very large or very small inputs, their slope becomes almost zero. During backpropagation, these tiny slopes make the weight updates shrink to nearly nothing, so the early layers of a deep network learn extremely slowly or stop learning entirely. This is called the vanishing gradient problem.

ReLU largely avoids this for positive inputs, which is a major reason it became the default choice for deep networks.

Summary

  • An activation function decides a neuron's output and adds the non-linearity that lets networks learn complex patterns.
  • Without it, a deep network would collapse into a simple linear model.
  • Sigmoid (0→1) and Tanh (−1→1) are smooth S-curves but suffer from vanishing gradients.
  • ReLU is the fast, default choice for hidden layers, with Leaky ReLU fixing its "dying" problem.
  • Softmax turns outputs into probabilities for multi-class classification — use ReLU in hidden layers, Sigmoid/Softmax at the output.