Back Prop question

ai_is_cool · May 21, 2025, 9:28am

How is back prop used to determine the optimal values for weight and bias parameters?

What is the algorithm.

conscell · May 21, 2025, 11:32am

Hi @ai_is_cool,
Backpropagation computes the gradients of a loss function with respect to each parameter (weights and biases), which are then used to update those parameters in a way that minimizes the loss.
Using the chain rule, the partial derivatives of the loss with respect to each parameter are computed. These derivatives (gradients) tell us how to change each parameter to reduce the loss. Each parameter is updated by subtracting a small fraction (learning rate, \alpha) of its gradient.

ai_is_cool · May 21, 2025, 12:20pm

Don’t you mean the gradients of the cost function J(w, b)?

So starting with values for w^{[n]}, b^{[n]} X^{(i)} and y^{(i)}; 1 \le n \le N

Step 1. J(w, b) is computed by forward prop and its gradient with respect to w_j and b are also computed by back prop - correct?

Step 2. The gradient descent algorithm can then be used to update w_j and b using estimates of \frac{\partial J(w, b)}{\partial w_j} and \frac{\partial J(w, b)}{\partial b}.

Then the first step above can then be repeated and gradient descent repeated also.

Optimal values for the weight and bias parameters can then be determined by repeating the above two steps for a given number of epochs - 1. Correct?

Or am I wrong?

How are the parameters updated to optimal values for each layer in a 3 layer network like the MNIST handwritten digit example?

conscell · May 21, 2025, 2:06pm

First, let’s agree on the notation. Let’s say we have a neural network with L layers, each layer has a weight matrix W^{[l]} and a bias vector b^{[l]}, 1 \le l \le L. We denote matrices with capital letters and vectors with lowercase letters. Each hidden layer l < L uses a non-linearity g (it can be ReLU, sigmoid, tanh, etc.) The inputs of the network are a vector x^{(i)} and a scalar y^{(i)}; X is a full / mini batch of training examples. The cost function J depends on W^{[l]}, b^{[l]} \forall l. We use {\mathcal L^{(i)}} to denote the loss for i-th training example.
You are right, in the first step we compute J by forward prop.
Next we compute {\partial J \over \partial W^{[l]}} and {\partial J \over \partial b^{[l]}} for l = L, L-1, \dots, 1. Parameters update is a part of the optimization algorithm such as SGD, Adam, etc. All the above steps are repeated for a fixed number of epochs or until we reach a certain value of loss / accuracy.

ai_is_cool · May 21, 2025, 2:14pm

Ok, notation agreed.

So we compute the gradients for layer l = L first, then the previous layer l = L - 1?

conscell · May 21, 2025, 2:15pm

How are the parameters updated to optimal values for each layer in a 3 layer network like the MNIST handwritten digit example?

Excellent question! I am going to address it tomorrow because it is late in my timezone.

ai_is_cool · May 21, 2025, 2:17pm

Ok, no problem.

Thanks.

conscell · May 21, 2025, 2:18pm

So we compute the gradients for layer l = L first, then the previous layer l = L − 1?

Yes, we begin from the last layer and going back to the first one. This is why it’s called back propagation.

ai_is_cool · May 21, 2025, 4:58pm

Ok but when computing the gradients for the last layer, when need the output a^{[L - 1]} from the previous layer - yes? And that output a^{[L -1]} is from forward prop - yes?

conscell · May 22, 2025, 7:41am

Let a^{[0]} = x^{(i)}, x^{(i)} \in {\mathbb R^{1 \times n}}, and {\bar y}^{(i)} \in \{0, 1\}^{1 \times N} is one-hot target vector for N classes.
The multi-layer neural network forward pass can be written as
z^{[l]} = a^{[l - 1]}W^{[l]} + b^{[l]}, a^{[l]} = g(z^{[l]}), for l = 1, \dots, L - 1
z^{[L]} = a^{[L - 1]}W^{[L]} + b^{[L]}
\displaystyle a^{[L]} = {e^{z^{[L]}} \over \sum_{j = 1}^ N e^{z^{[L]}}_j} = {\hat y}^{(i)}, where e^{z^{[L]}} \in \mathbb R^{1 \times N} is the elementwise exponential
\displaystyle {\mathcal L}^{(i)} = - {\bar y}^{(i)} \cdot \log {\hat y}^{(i)}
\displaystyle J = {1 \over m} \sum_{i = 1}^m {\mathcal L}^{(i)}.
The derivative of the loss with respect to z^{[L]} is:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)} \quad \in \mathbb{R}^{1 \times N}.

For l \le L - 1 we have \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l + 1]}} {W^{[l+1]}}^\top \right) \circ g'(z^{[l]}), where \circ denotes elementwise (Hadamard) product, and g' is the derivative of the activation function at layer l.

Then the gradients are:

\frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}} = {a^{[l-1]}}^\top \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} \quad \in \mathbb{R}^{n_{l-1} \times n_l},

\frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}} =\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} \quad \in \mathbb{R}^{1 \times n_l}.

For the cost J we have \frac{\partial J}{\partial W^{[l]}} = {1 \over m} \sum_{i = 1}^m \frac{\partial \mathcal{L}^{(i)}}{\partial W^{[l]}}, and \frac{\partial J}{\partial b^{[l]}} = {1 \over m} \sum_{i = 1}^m \frac{\partial \mathcal{L}^{(i)}}{\partial b^{[l]}}.

ai_is_cool · May 22, 2025, 3:42pm

Do these mathematics assume one unit per layer?

ai_is_cool · May 22, 2025, 6:06pm

Can you take me through your working for how you got this?

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)} \quad \in \mathbb{R}^{1 \times N}

paulinpaloalto · May 22, 2025, 6:50pm

Here is a thread which covers that in the context of the way things are presented in DLS.

Note that MLS does not cover the details of back prop, but DLS does. One strategy here would be to “hold that thought” and then take DLS when you finish MLS.

ai_is_cool · May 22, 2025, 9:03pm

@paulinpaloalto This thread that you cited applies to single-valued variables and not vectors. It also uses the sigmoid activation function and not the softmax activation function so it does not answer my question.

conscell · May 22, 2025, 9:05pm

Here we assume that layer l has n_l units.

ai_is_cool · May 22, 2025, 9:14pm

How does that tell me how many n_1 units are in layer 1, for example?

Actually I suppose the value held in this variable is the number of units in layer 1.

conscell · May 22, 2025, 9:19pm

It is up to us to decide how many units in each layer.

conscell · May 22, 2025, 9:26pm

Can you take me through your working for how you got this?

For now you can check this material which discusses your question in the great detail. I’ll make this derivation later in this thread as well.

paulinpaloalto · May 22, 2025, 9:29pm

The thread I linked applies to vectors as well: all the operations from z in the last layer to the cost are “elementwise”. That’s a good point about sigmoid versus softmax. It turns out the math works out the same as with sigmoid, but the derivation is a bit more complicated. @conscell has provided a link to the CS229 material. Here’s another derivation from StackExchange.

conscell · May 23, 2025, 7:20am

Taking into account the property of the one-hot vector, we expand the log of the softmax using the logarithmic difference identity:

{\mathcal L^{(i)}} = - \log {\hat y}^{(i)}_{y^{(i)}} = -\log {e^{z^{[L]}_{y^{(i)}}} \over \sum_{j = 1}^N e^{z^{[L]}}_j} = \log \sum_{j = 1}^N e^{z^{[L]}}_j - z^{[L]}_{y^{(i)}}.

Take the derivative w.r.t. each logit z^{[L]}_k:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}_k} = {\sum_{r = 1}^N {\partial \over \partial z^{[L]}_k}e^{z^{[L]}}_r \over \sum_{j = 1}^N e^{z^{[L]}}_j} - 𝟙\{k=y^{(i)}\} ={e^{z^{[L]}}_k \over \sum_{j = 1}^N e^{z^{[L]}}_j} - {\bar y}^{(i)}_k = {\hat y}^{(i)}_k - {\bar y}^{(i)}_k.

Thus, in vector notation:

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[L]}} = \hat y^{(i)} - \bar y^{(i)}.

Topic		Replies	Views
Cost function calculation in neural network Advanced Learning Algorithms week-module-2	3	476	September 6, 2023
Why only use backprop to adjust parameters? Advanced Learning Algorithms week-module-1	10	538	January 24, 2023
Didn't understand how gradient computation using back prop is order of N+P Advanced Learning Algorithms week-module-2	4	257	February 26, 2024
Backpropagation when using dropout and Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	602	February 11, 2022
So, why do we need back propogation? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	492	May 11, 2023

Back Prop question

Related topics