Activation Functions and Backpropagation

Beginner

What is an activation function?

A neural network is a stack of linear transformations: each layer multiplies inputs by weights and adds a bias. Without something extra, stacking linear layers is equivalent to one single linear layer, no matter how deep the network is. An activation function sits after each linear step and introduces non-linearity, allowing the network to learn curved, complex boundaries instead of only straight lines.

Input  ->  [Linear: z = W·x + b]  ->  [Activation: a = f(z)]  ->  Next layer

Without activation: deep_network(x) = W3·(W2·(W1·x)) = W_combined·x   (still linear)
With activation:    deep_network(x) = f(W3·f(W2·f(W1·x)))              (non-linear)

The choice of activation function affects how fast the network learns, whether gradients vanish during training, and what range of values flows between layers.

The three most used activation functions

1. ReLU (Rectified Linear Unit)

ReLU is the default choice for hidden layers in almost all modern networks.

f(z) = max(0, z)

  z < 0  ->  output is 0      (neuron is "off")
  z >= 0  ->  output equals z  (neuron passes the signal through)

Shape:
  |         /
  |        /
  |       /
  |______/___________
        0

Why it works: computationally cheap (just a comparison), does not saturate for positive inputs so gradients flow easily during training, and produces sparse activations (many neurons output exactly zero) which helps generalization.

Main weakness: the "dying ReLU" problem. If a neuron's input is always negative, its output is always zero, its gradient is always zero, and its weights never update. The neuron is permanently dead.

2. Sigmoid

Sigmoid squashes any real number into the range (0, 1), making it natural for binary classification output layers and for gates in LSTMs.

f(z) = 1 / (1 + e^(-z))

  z = -6   ->  ~0.002   (near zero)
  z =  0   ->  0.5      (midpoint)
  z = +6   ->  ~0.998   (near one)

Shape:
  1 |          _____
    |        /
  0 |_______/____________
             0

Why it works: output can be interpreted as a probability. Clear intuition for binary decisions.

Main weakness: saturation. For very large or very small inputs, the gradient of sigmoid is nearly zero. During backpropagation, multiplying many near-zero gradients together causes the vanishing gradient problem: earlier layers receive almost no learning signal and stop updating. This is why sigmoid is rarely used in hidden layers of deep networks.

3. Softmax

Softmax is used in the output layer of multi-class classifiers. It converts a vector of raw scores (logits) into a probability distribution that sums to 1.

For a vector z = [z1, z2, ..., zK]:

  softmax(zi) = e^zi / sum(e^zj for all j)

Example — 3-class classification:
  Logits:       [2.0,  1.0,  0.1]
  Exponents:    [7.39, 2.72, 1.11]   sum = 11.22
  Probabilities:[0.659, 0.242, 0.099]  (sum = 1.0)

The class with the highest logit gets the highest probability.
Temperature scaling divides all logits by T before softmax:
  T < 1  ->  sharper distribution (more confident)
  T > 1  ->  flatter distribution (more uncertain)

Why it works: each output is in (0, 1), they sum to 1, and the largest logit always gets the highest probability. Pairs naturally with cross-entropy loss.

Main weakness: softmax is sensitive to large logit differences (one class dominates). Not used in hidden layers; it is a final output transformation.

Quick rule of thumb: use ReLU (or a variant) in all hidden layers, sigmoid for binary output, and softmax for multi-class output.

Backpropagation

The big picture

After a forward pass produces a prediction, the network computes a loss — a number that measures how wrong the prediction was. Backpropagation is the algorithm that answers: how much did each weight contribute to that error? It does this by applying the chain rule of calculus backwards through the network, computing the gradient of the loss with respect to every weight.

Training loop:

  1. Forward pass:   input  ->  layers  ->  prediction  ->  loss
  2. Backward pass:  loss   ->  compute gradient of loss w.r.t. every weight
  3. Update:         weight = weight - learning_rate * gradient
  4. Repeat

The chain rule — the engine of backprop

The chain rule says that if output depends on intermediate values that depend on the input, you multiply the local gradients together.

Network (simplified, one hidden layer):

  z1 = W1·x + b1          (linear step)
  a1 = f(z1)              (activation)
  z2 = W2·a1 + b2         (linear step)
  Loss = L(z2, y)         (loss vs. true label y)

Chain rule for W1:

  dL/dW1 = dL/dz2 · dz2/da1 · da1/dz1 · dz1/dW1

Each factor is a local gradient that the layer computes from its own inputs and outputs.
No layer needs to know about layers outside its immediate neighbors.

Step-by-step walkthrough

Consider a tiny network: one input, one hidden neuron with ReLU, one output neuron, MSE loss.

x = 2.0, y_true = 1.0
W1 = 0.5, W2 = 0.3

--- Forward pass ---
z1   = W1 * x       = 0.5 * 2.0 = 1.0
a1   = ReLU(z1)     = max(0, 1.0) = 1.0
z2   = W2 * a1      = 0.3 * 1.0 = 0.3
Loss = (z2 - y)^2  = (0.3 - 1.0)^2 = 0.49

--- Backward pass (chain rule) ---
dL/dz2  = 2*(z2 - y)      = 2*(0.3 - 1.0)   = -1.4
dz2/dW2 = a1               = 1.0
dz2/da1 = W2               = 0.3
da1/dz1 = 1 if z1>0 else 0 = 1   (ReLU gradient)
dz1/dW1 = x                = 2.0

dL/dW2 = dL/dz2 * dz2/dW2 = -1.4 * 1.0 = -1.4
dL/dW1 = dL/dz2 * dz2/da1 * da1/dz1 * dz1/dW1
       = -1.4 * 0.3 * 1 * 2.0 = -0.84

--- Update (learning rate = 0.1) ---
W2 = 0.3  - 0.1 * (-1.4)  = 0.44
W1 = 0.5  - 0.1 * (-0.84) = 0.584

Vanishing and exploding gradients

The chain rule multiplies many gradients together. This creates two common failure modes in deep networks:

Vanishing gradients:
  Each local gradient < 1  ->  product shrinks exponentially with depth
  Earlier layers get near-zero updates  ->  they stop learning
  Caused by: sigmoid/tanh in hidden layers, very deep networks

Exploding gradients:
  Each local gradient > 1  ->  product grows exponentially with depth
  Weights blow up to infinity  ->  training collapses (NaN loss)
  Caused by: large weight initialization, deep RNNs

Solutions:
  Vanishing: use ReLU activations, residual connections, batch normalization
  Exploding: gradient clipping, careful weight initialization (Xavier, He)

Why ReLU helps: its gradient is exactly 1 for all positive inputs (no shrinkage), which lets gradients flow back through many layers without vanishing.

Advanced

Activation function variants

Leaky ReLU:  f(z) = z if z >= 0 else 0.01*z
  Fixes dying ReLU: negative inputs still produce a small gradient.

ELU (Exponential Linear Unit):
  f(z) = z if z >= 0 else alpha*(e^z - 1)
  Smoother than ReLU, negative values push mean activations toward zero.

GELU (Gaussian Error Linear Unit):
  f(z) = z * Phi(z)   (Phi = standard normal CDF)
  Used in BERT, GPT, and most modern transformers. Smooth approximation
  to ReLU with better empirical performance on NLP tasks.

Swish: f(z) = z * sigmoid(z)
  Self-gated, smooth, used in EfficientNet.

Code: forward and backward pass from scratch

import numpy as np

# Activation functions and their gradients
def relu(z):       return np.maximum(0, z)
def relu_grad(z):  return (z > 0).astype(float)

def sigmoid(z):       return 1 / (1 + np.exp(-z))
def sigmoid_grad(z):  s = sigmoid(z); return s * (1 - s)

def softmax(z):
    e = np.exp(z - z.max())   # subtract max for numerical stability
    return e / e.sum()

# Tiny 2-layer network: input(2) -> hidden(3, ReLU) -> output(1, sigmoid)
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1   # (hidden, input)
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1   # (output, hidden)
b2 = np.zeros(1)

x      = np.array([1.0, 0.5])
y_true = np.array([1.0])

# --- Forward pass ---
z1 = W1 @ x + b1
a1 = relu(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)

loss = ((a2 - y_true) ** 2).mean()
print(f"Loss: {loss:.4f}")

# --- Backward pass ---
lr = 0.1

dL_da2  = 2 * (a2 - y_true) / y_true.size     # MSE gradient
dL_dz2  = dL_da2 * sigmoid_grad(z2)            # through sigmoid
dL_dW2  = np.outer(dL_dz2, a1)                 # weight gradient
dL_db2  = dL_dz2
dL_da1  = W2.T @ dL_dz2                        # back through W2
dL_dz1  = dL_da1 * relu_grad(z1)               # through ReLU
dL_dW1  = np.outer(dL_dz1, x)                  # weight gradient
dL_db1  = dL_dz1

# --- Update ---
W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1

print("Weights updated.")

Code: same network with PyTorch autograd

import torch
import torch.nn as nn

# PyTorch computes backprop automatically via autograd
model = nn.Sequential(
    nn.Linear(2, 3),
    nn.ReLU(),
    nn.Linear(3, 1),
    nn.Sigmoid(),
)

x      = torch.tensor([[1.0, 0.5]])
y_true = torch.tensor([[1.0]])

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn   = nn.MSELoss()

# Forward
y_pred = model(x)
loss   = loss_fn(y_pred, y_true)
print(f"Loss: {loss.item():.4f}")

# Backward (chain rule computed automatically)
optimizer.zero_grad()
loss.backward()

# Inspect a gradient
print("Gradient for first Linear weight:")
print(model[0].weight.grad)

# Update weights
optimizer.step()

Comparison of activation functions

Function	Output range	Gradient at saturation	Typical use
ReLU	[0, +inf)	0 for z<0, 1 for z>0	Hidden layers (default)
Leaky ReLU	(-inf, +inf)	0.01 for z<0, 1 for z>0	Hidden layers when dying ReLU is a problem
GELU	(-inf, +inf)	Smooth, non-zero everywhere	Transformer hidden layers (BERT, GPT)
Sigmoid	(0, 1)	~0 at extremes (vanishing)	Binary classification output, LSTM gates
Softmax	(0, 1), sums to 1	N/A (output layer only)	Multi-class classification output

To-do list

Learn

Explain in one sentence why non-linearity is required for neural networks to be useful.
Write the formula for ReLU, sigmoid, and softmax from memory.
Describe the dying ReLU problem and one way to fix it.
Explain the vanishing gradient problem: what causes it and which activation function makes it worse.
State the chain rule and explain how it connects each weight to the final loss.

Practice

By hand, compute one forward pass and one backward pass through a two-layer network with specific numbers (like the walkthrough above).
Plot ReLU, sigmoid, and softmax (for a 3-element vector) in Python to build visual intuition.
Train a small network on a toy dataset with sigmoid hidden layers, then switch to ReLU and compare the training curves.
Implement gradient clipping and observe its effect when gradients are large.

Build

Implement a two-layer neural network in pure NumPy with forward pass, backprop, and gradient descent — no frameworks.
Reproduce the same network in PyTorch using autograd and confirm the gradients match your manual computation.
Train a classifier on a non-linearly separable dataset (e.g., XOR or moons) and verify that the network learns the boundary correctly.
Experiment with He initialization vs. random initialization and observe the effect on early training loss.