Back Prop question

ai_is_cool · May 23, 2025, 10:54am

You are taking the derivative of the loss function w.r.t to a vector quantity e^{z^[L]} not a vector component e^{z^[L]}_{j} of that vector - correct?

conscell · May 23, 2025, 11:46am

Of course, if you’re familiar with matrix calculus, you can proceed straightforwardly. However, there’s an important subtlety: in this context, we’re dealing with row vectors, which adds extra complexity. You need to be mindful of both the layout and the order in which you apply the chain rule, since most rules are designed for column vectors. To avoid mistakes, a safer approach is to perform the derivations elementwise first, and then verify that they generalize correctly to the vector form. I believe this is the best strategy, especially if you’d rather not deal with 3D tensors in the next steps.

conscell · May 23, 2025, 11:57am

No, we’re talking the derivative of the loss with respect to the logits z^{[L]}, but in order to make the process more tractable we’re doing it elementwise.

ai_is_cool · May 23, 2025, 12:09pm

Not with respect to the logits but with respect to a vector of logits.

What is the derivative of the vector e^{z^{[L]}} with respect to the vector z^{[L]}?

ai_is_cool · May 23, 2025, 3:31pm

Earlier you defined the following;

\displaystyle {\mathcal L}^{(i)} = - {\bar y}^{(i)} \cdot \log {\hat y}^{(i)}

However, here you now re-define {\mathcal L}^{(i)} as;

\displaystyle {\mathcal L^{(i)}} = - \log {\hat y}^{(i)}_{y^{(i)}}

So which is correct?

Is it because in the first definition you use one-hot vector values {\bar y}^{(i)} instead of logits z_{k}^{[L]}?

Also, please define y^{(i)}.

Just so I’m clear on the mathematics that you have presented to me, can you list all the variables as scalars and vectors?

ai_is_cool · May 23, 2025, 9:09pm

What variable is g differentiated with respect to in order to get g'?

conscell · May 23, 2025, 11:30pm

What is the derivative of the vector e^{z^{[L]}} with respect to the vector z^{[L]}?

Let z^{[L]} \in \mathbb{R}^{1 \times N} be a row vector of logits, and e^{z^{[L]}} \in \mathbb{R}^{1 \times N} be the elementwise exponential (e^{z^{[L]}})_k = e^{z^{[L]}_k}.

The derivative is the Jacobian matrix

\frac{\partial e^{z^{[L]}}_k}{\partial z^{[L]}_j} = \begin{cases} e^{z^{[L]}_k} & \text{if } j = k \\ 0 & \text{otherwise} \end{cases} \Rightarrow \frac{\partial {e}^{z^{[L]}}}{\partial {z}^{[L]}} = \operatorname{diag}(e^{z^{[L]}})

conscell · May 24, 2025, 12:10am

You observed two forms of the loss:
\mathcal{L}^{(i)} = -\bar{y}^{(i)} \cdot \log \hat{y}^{(i)}
\mathcal{L}^{(i)} = -\log \hat{y}^{(i)}_{y^{(i)}}
These are equivalent, because \bar{y}^{(i)} \in \{0, 1\}^{1 \times N} is one-hot, so:

-\bar{y}^{(i)} \cdot \log \hat{y}^{(i)} = -\sum_{k=1}^N \bar{y}_k^{(i)} \log \hat{y}_k^{(i)} = -\log \hat{y}_{y^{(i)}}^{(i)}

The scalar y^{(i)} \in \{1, \dots, N\} is the true class label for the i-th example. It corresponds to the index of the 1 in the one-hot vector \bar{y}^{(i)}:

\bar{y}^{(i)}_k = \mathbb{1}\{k = y^{(i)}\}.

Here is the notation summary:
x^{(i)} - Training example (Row vector 1 \times n)
y^{(i)} - True class label (Scalar \in \{1, \dots, N\})
\bar{y}^{(i)} - One-hot target label (Row vector 1 \times N)
a^{[l]} - Activations at layer l (Row vector 1 \times n_l)
z^{[l]} - Pre-activations (Row vector 1 \times n_l)
W^{[l]} - Weight matrix (Matrix n_{l-1} \times n_l)
b^{[l]} - Bias vector (Row vector 1 \times n_l)
g(z) - Activation function (Elementwise function)
g'(z) - Derivative of activation (Elementwise function)
\hat{y}^{(i)} - Predicted probabilities (Row vector 1 \times N)
\mathcal{L}^{(i)} - Loss for example i (Scalar)
J - Cost (Scalar)

conscell · May 24, 2025, 12:22am

The derivative g' is the derivative of the activation function g with respect to its input, which is z^{[l]}.

ai_is_cool · May 24, 2025, 10:40am

Thanks.

Please define N, n_l and n.

conscell · May 24, 2025, 10:44am

N is the number of classes, n is the number of input features, n_l is the number of units in the layer l.

ai_is_cool · May 24, 2025, 11:15am

Ok, in Andrew’s course he explains back prop by first computing the increase in the cost when the loss is increased a tiny amount, 0.001. From this, he estimates the gradient of the cost with respect to the loss. He then estimates the gradient of the cost in the previous layer using the output layer gradient estimate and this previous layer’s activation function and parameters. This is repeated until he reaches the input layer where he computes the gradient of the cost with respect to that layer’s weight and bias parameters.

However, I see from your mathematics presentations you are actually computing the gradient of the loss exactly and then computing the gradient of the cost from this at the end before arriving at a different result from Andrew.

Andrew doesn’t present the mathematics that you do.

TMosh · May 24, 2025, 4:58pm

Andrew is giving the classic intuitive explanation for the concept of partial derivatives. He doesn’t use calculus in his explanation, because high-level math is not a prerequisite for the course.

In practice, the code for the gradients comes directly from applying the partial derivatives.

ai_is_cool · May 24, 2025, 5:19pm

So in practice Andrew’s method isn’t used? The mathematics presented by @conscell is used instead?

paulinpaloalto · May 24, 2025, 5:24pm

As Tom points out, Andrew has not shown any actual mathematics for back prop here. He doesn’t do that until you get to the DLS specialization, although even there he just shows the resulting formulas and does not cover the mathematical derivation of it. All he’s doing here is just giving you some high level intuition for how it works.

The formulas he will show in DLS are equivalent to what Pavel has carefully shown above, although Prof Ng’s notation is a bit different.

TMosh · May 24, 2025, 5:35pm

Yes.

This teaching method is very common in Andrew’s lectures.

ai_is_cool · May 24, 2025, 5:51pm

I see. It’s a little bit misleading as I have studied back prop according to Andrew and it seems how he explains it isn’t actually how it’s done in practice as in commercial projects.

TMosh · May 24, 2025, 5:59pm

This is true.

ai_is_cool · May 24, 2025, 6:13pm

Are you a ML Practitioner in industry?

Is there a Coursera Specialization that teaches ML industry-standard algorithms to get someone ready for working in industry?

ai_is_cool · May 24, 2025, 9:28pm

Can you take me through how you arrived at this expression as Andrew doesn’t produce this in his course so far:

For l \le L - 1 we have

\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l + 1]}} {W^{[l+1]}}^\top \right) \circ g'(z^{[l]})

where \circ denotes elementwise (Hadamard) product, and g’ is the derivative of the activation function at layer l.

Can you also take me through a more algorithmic explanation of how back prop works according to your mathematics and confirm if this is how it is done in industry?

Thanks.

Topic		Replies	Views
Cost function calculation in neural network Advanced Learning Algorithms week-module-2	3	476	September 6, 2023
Why only use backprop to adjust parameters? Advanced Learning Algorithms week-module-1	10	538	January 24, 2023
Didn't understand how gradient computation using back prop is order of N+P Advanced Learning Algorithms week-module-2	4	257	February 26, 2024
Backpropagation when using dropout and Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	602	February 11, 2022
So, why do we need back propogation? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	492	May 11, 2023

Back Prop question

Related topics