Gentle introduction to Neural Networks — Part 2(Backward Propagation)

Shubham Kanwal
3 min readJun 2, 2021

In Part1 we saw what happens in forward propagation(making prediction). In this we will be continuing what happens after loss/cost/error is calculated.

Backpropagation is method for efficiently calculating gradients(derivative) of cost function with respect to its parameters. Simply put, it finds how much each weight and bias should be tweaked in order to reduce the error.

So if we combine both forward and backward propagation -

Network makes prediction (forward propagation) and measure the error , then goes through each layer in reverse to measure error contribution from each connection using chain rule until it reaches input layer. Then it performs Gradient Descent step to tweak all the weights using gradients(derivative) it just calculated.

For this explanation will be using simple NN-

O1 and O2 are outputs of hidden layer, O3 is predicted value which is y(hat)

Let’s break it down in to bits-

We start by calculating gradients of loss with respect to weights using chain rule.

Chain Rule- It’s a way to compute derivative of a function whose variables are themselves functions of others. If C is a scaler-valued function of a scaler z and z is itself a scaler-valued function of another scaler variable w, then the chain rule states that

By applying chain rule-

Propagating from last layer to first layer and calculating gradients

  1. Rate of change of error with respect to w2 =

Similarly, rate of change of error with respect to w3 =

2. Rate of change of error with respect to w1 =

Since E is impacted by O3 and O3 in turn is impacted by O1 and O1 in turn is impacted by w1

Similarly , gradient will be calculated with respect to all weights and biases.

Now let’s suppose all gradients are calculated with respect to all weights and biases.

New weight is updated as follows-

Initially weights and biases will be initialized randomly.

The forward + backward propagation will continue till error is close to 0 or it has reached global minima. In each epoch(forward + backward) weights and biases will be updated.

To summarize , after loss is calculated in forward pass it’s passes to optimizer(Gradient Descent, Stochastic GD )which will try to minimize the loss by backward propagation(calculate gradients of error with respect to all weights and biases using chain rule)and finally updating weights and biases.

--

--