You are taking the derivative of the loss function w.r.t to a vector quantity e^{z^[L]} not a vector component e^{z^[L]}_{j} of that vector - correct?
Of course, if youâre familiar with matrix calculus, you can proceed straightforwardly. However, thereâs an important subtlety: in this context, weâre dealing with row vectors, which adds extra complexity. You need to be mindful of both the layout and the order in which you apply the chain rule, since most rules are designed for column vectors. To avoid mistakes, a safer approach is to perform the derivations elementwise first, and then verify that they generalize correctly to the vector form. I believe this is the best strategy, especially if youâd rather not deal with 3D tensors in the next steps.
No, weâre talking the derivative of the loss with respect to the logits z^{[L]}, but in order to make the process more tractable weâre doing it elementwise.
Not with respect to the logits but with respect to a vector of logits.
What is the derivative of the vector e^{z^{[L]}} with respect to the vector z^{[L]}?
Earlier you defined the following;
\displaystyle {\mathcal L}^{(i)} = - {\bar y}^{(i)} \cdot \log {\hat y}^{(i)}
However, here you now re-define {\mathcal L}^{(i)} as;
\displaystyle {\mathcal L^{(i)}} = - \log {\hat y}^{(i)}_{y^{(i)}}
So which is correct?
Is it because in the first definition you use one-hot vector values {\bar y}^{(i)} instead of logits z_{k}^{[L]}?
Also, please define y^{(i)}.
Just so Iâm clear on the mathematics that you have presented to me, can you list all the variables as scalars and vectors?
What variable is g differentiated with respect to in order to get g'?
What is the derivative of the vector e^{z^{[L]}} with respect to the vector z^{[L]}?
Let z^{[L]} \in \mathbb{R}^{1 \times N} be a row vector of logits, and e^{z^{[L]}} \in \mathbb{R}^{1 \times N} be the elementwise exponential (e^{z^{[L]}})_k = e^{z^{[L]}_k}.
The derivative is the Jacobian matrix
You observed two forms of the loss:
\mathcal{L}^{(i)} = -\bar{y}^{(i)} \cdot \log \hat{y}^{(i)}
\mathcal{L}^{(i)} = -\log \hat{y}^{(i)}_{y^{(i)}}
These are equivalent, because \bar{y}^{(i)} \in \{0, 1\}^{1 \times N} is one-hot, so:
The scalar y^{(i)} \in \{1, \dots, N\} is the true class label for the i-th example. It corresponds to the index of the 1 in the one-hot vector \bar{y}^{(i)}:
Here is the notation summary:
x^{(i)} - Training example (Row vector 1 \times n)
y^{(i)} - True class label (Scalar \in \{1, \dots, N\})
\bar{y}^{(i)} - One-hot target label (Row vector 1 \times N)
a^{[l]} - Activations at layer l (Row vector 1 \times n_l)
z^{[l]} - Pre-activations (Row vector 1 \times n_l)
W^{[l]} - Weight matrix (Matrix n_{l-1} \times n_l)
b^{[l]} - Bias vector (Row vector 1 \times n_l)
g(z) - Activation function (Elementwise function)
g'(z) - Derivative of activation (Elementwise function)
\hat{y}^{(i)} - Predicted probabilities (Row vector 1 \times N)
\mathcal{L}^{(i)} - Loss for example i (Scalar)
J - Cost (Scalar)
The derivative g' is the derivative of the activation function g with respect to its input, which is z^{[l]}.
Thanks.
Please define N, n_l and n.
N is the number of classes, n is the number of input features, n_l is the number of units in the layer l.
Ok, in Andrewâs course he explains back prop by first computing the increase in the cost when the loss is increased a tiny amount, 0.001. From this, he estimates the gradient of the cost with respect to the loss. He then estimates the gradient of the cost in the previous layer using the output layer gradient estimate and this previous layerâs activation function and parameters. This is repeated until he reaches the input layer where he computes the gradient of the cost with respect to that layerâs weight and bias parameters.
However, I see from your mathematics presentations you are actually computing the gradient of the loss exactly and then computing the gradient of the cost from this at the end before arriving at a different result from Andrew.
Andrew doesnât present the mathematics that you do.
Andrew is giving the classic intuitive explanation for the concept of partial derivatives. He doesnât use calculus in his explanation, because high-level math is not a prerequisite for the course.
In practice, the code for the gradients comes directly from applying the partial derivatives.
So in practice Andrewâs method isnât used? The mathematics presented by @conscell is used instead?
As Tom points out, Andrew has not shown any actual mathematics for back prop here. He doesnât do that until you get to the DLS specialization, although even there he just shows the resulting formulas and does not cover the mathematical derivation of it. All heâs doing here is just giving you some high level intuition for how it works.
The formulas he will show in DLS are equivalent to what Pavel has carefully shown above, although Prof Ngâs notation is a bit different.
Yes.
This teaching method is very common in Andrewâs lectures.
I see. Itâs a little bit misleading as I have studied back prop according to Andrew and it seems how he explains it isnât actually how itâs done in practice as in commercial projects.
This is true.
Are you a ML Practitioner in industry?
Is there a Coursera Specialization that teaches ML industry-standard algorithms to get someone ready for working in industry?
Can you take me through how you arrived at this expression as Andrew doesnât produce this in his course so far:
For l \le L - 1 we have
\frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l]}} = \left( \frac{\partial \mathcal{L}^{(i)}}{\partial z^{[l + 1]}} {W^{[l+1]}}^\top \right) \circ g'(z^{[l]})
where \circ denotes elementwise (Hadamard) product, and gâ is the derivative of the activation function at layer l.
Can you also take me through a more algorithmic explanation of how back prop works according to your mathematics and confirm if this is how it is done in industry?
Thanks.