Hi, Code with Africa.

We do **Backprop** to propagate the total loss back in the NN (Neural Network) in order to identify **how much of the loss** every node is responsible for. It also helps in **computing the derivatives of the cost** w.r.t parameters that we have used while implementing the forward pass. Here, we mainly implement the ‘**chain rule of calculus**’.

In the instructions given in the notebook, it is clearly mentioned on how are you building a neural network:

**Reminder**: The general methodology to build a Neural Network is to:

- Define the neural network structure ( # of input units, # of hidden units, etc).
- Initialize the model’s parameters
- Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)

Now, your query on **L**, which simply signifies, a **sort of vector output** of the loss function and simultaneously, **J** is the average of the values of L across the samples that we have used in training this model. And that’s where, the factor of 1/m gets its significance in the given formula:

𝐽=−1𝑚∑𝑖=1𝑚(𝑦(𝑖)log(𝑎2)+(1−𝑦(𝑖))log(1−𝑎2))

To your second query on why we start with hidden layers: **we do the same implementation in the backward what we had used in the forward direction using the hidden layers where caches have been generated as (linear_cache, activation_cache) for computation purposes.** We can say, Backprop stands in opposition to forward prop.

Backprop is the essence of any neural network training that fine tunes the weights of a neural net based on the error rate which is also termed as **loss**.

Check this thread that discusses about functioning of backprop in detail. You will get a very clear idea on how it is implemented for : *linear_activation_backward* , which then calls *linear_backward* , *relu_backward* and *sigmoid_backward*.