Forward propagation, backward propagation, various activation functions, various cost functions, vectorization…
Would you like to read my derivation of the mathematics behind feedforward neural networks?
Forward propagation, backward propagation, various activation functions, various cost functions, vectorization…
Would you like to read my derivation of the mathematics behind feedforward neural networks?
excellent in depth explanation about Forward and backward propagation
thanks for sharing it with us
Great resource thank you for putting it out Jonas.
One question I have with the Part 1 article - the definition of the activation seems a bit puzzling to be. Now it is defined as
But I thought the activation function (g_j) is a function that takes in scalar value and also outputs scalar value. So I thought the equation should be
This is at least my understanding from the current course 1. Would appreciate if anyone points out what I might have misunderstood… Thanks a lot again
Great question. If you check out the second article in the series, you will find that I also derive the math for the softmax activation function. Since it uses an average, the values of all other nodes in the current layer are used to compute the value for one node. Hence, I need a more general formulation of the activation function than used in the lectures, where the derivation is left as an exercise.
Thank you for the clarification! So I suppose this formula will start to make more sense once I start getting into the second course where we have multi-label classification
Hi @jonaslalin.
I note that you do not use a 1/m term in (13) and (14) of post one - i.e. your gradients are summed, rather than being averaged as in the lectures. Is this intentional?
In other words, should one average or sum the gradients of dw and db over all training examples?
Thanks.
Yes this is intentional. In the lectures you compute dL/dW, dL/db. I never introduce a “loss” function, only the cost function J.
I am very busy at the moment, but I will respond to your question as soon as I can. I believe more learners have the same question. It is true that I need to explain how I reach 13 and 14 in more detail.
keep in mind that Z[l] = W[l]A[l-1]+b[l]. since we are trying to find dZ/dW, then we treat everything except dW as constant. thus dZ/dW = A[l-1]
I don’t know if I could help, but here’s my answer:
As we know, for the l-th layer, z = w[l]a[l-1] + b[l]
So, dz[l] / dw[l] = d(w[l]a[l-1] + b[l]) / dw[l] = a[l-1] + 0 = a[l-1]
That really helps me. Thanks.
Amazing notes.
Thanks for your kindly explanation.
In “Feedforward Neural Networks in Depth, Part1…”, equation (12), how do you proof this? It seems odd to me that summation of partial derivatives. Thanks in advance.
I think it’s a simple multivariable calculus with a chain rule. Conditions given are;
Then, a partial derivative of y_k with respect to x_j will be;
As y_k = f_k(), and u_i = g_i(), I think the equation (12) is correct.
If you differentiate equation (1) with respect to w[l] you get that a[l-1] in equation (13).
how can I download the articles in pdf format?
Thank you!
Is there any way to download the articles in pdf?