Summation means summing all the terms. You highlighted the addition sign (+) in the second figure. All terms are adding starting from i = 1
to i = m
.
Dear Mr Saif,
May i know why should we add all terms together in this case?
Thank you.
First, give us the background. What is the relationship of u_{i} and g_{i}(x_{1},...,x_{j},...,x_{n}). Addition? And, y_{k} & f_{k}(u_{1},...,u_{i},...,u_{m}).
Thank you for sharing, Your notes were very organized and useful. As an MSc student I understood most of it.
For calculating backward propagation equations, However, I prefer writing the chain rule wholly all the way from cost function to the target . For example, instead of writing 4 chain rules with the length of 1, I find it simpler to write a single line chain rule with the length of 4. It’s basically the same thing but I find the second formation easier to understand.
Just trying to throw a suggestion out there. Thank you again for the amazing notes.
U must calculate dZ/dW from eq (3), Z[l] = W[l]A[l-1]+B[l]. Where dZ/dW = A[l-1]
Hi, could you clarify the notation for f in (9) in part 1? I was expecting f to be f:R^{n^L x m} x R^{n^L x m} → R, but it is f:R^2n^L → R
Any idea why the MathML embedded in the neural net SVGs here would fail to render? I get the same issue in Chrome, Firefox and MS Edge,
I was confused by this when I first saw this as well. I found this article that might be helpful:
(see section titled " The Generalized Chain Rule")
Thanks for wonderful work.
Thank you for sharing, @peppy
Thanks for the derivation. I initially thought after reading part 1 article that equation #17 was an overkill given that for any layer l, any unit j in that later, any training example i, the activation function a_j^[l](i) would be a function of only z_j^[\l](i) and not any other a_p^[l](i). However after reading part 2 article and in particular the softmax function, I understood your motivation behind equation 17. Thanks much!
Coolest thing I have seen today
Hi,
I asked myself the same question and even send a personal e-mail to @jonaslalin about it before i read this conversation.
Thank for the explantation @rmwkwok .
Apologies for cluttering his mailbox.
Francis
No, but you can save as html. there are tools to parse html to pdf but it is touchy.
Hi, I had a doubt regarding the cost function of multi-class classification. Doesn’t this J give higher priority to classes having higher value as it is multiplied by y and have log(a[L])? Does it work only because of soft-max activation function as if we use any other function, its derivative with respect to Z[L] may not come out to be 1/m (A[L] - Y) ?
The author Jonas Slalin is the real expert here, but I’m not sure whether he is still answering questions here. If you don’t get an answer from him, you could check his website (where this material is posted) and see if he has a link there for posting questions.
Are you asking why the domain of function f is \mathbb{R}^{2n^{[L]}} as opposed to \mathbb{R}^{n^{[L]}} or are you asking what n^{[L]} means there?
If the latter, it is the number of neurons in the output layer of the network.
For the former question, my reading would be that the point is that both A^{[L]} and Y have dimension n^{[L]} x m. So I would have thought the m would need to figure in there. I haven’t read the rest of Jonas’s definitions in a couple of years, but it looks like he is using the same convention that Prof Andrew Ng does that J is the scalar cost function which is the average of the vector loss values L across the samples. So I would have thought that you needed to incorporate the m into the dimension of the domain, but maybe since taking the average is basically trivial we can ignore that. Then we’d be down to just the discretionary part of the function being the vector loss which has 2n^{[L]} inputs.
But all this is just notation in any case. When you define the actual functions in question, this should all be clear and will just “come out in the wash”.
His profile indicates last post date was May 2022