Feedforward Neural Networks in Depth

jonaslalin · February 14, 2022, 9:54am

Forward propagation, backward propagation, various activation functions, various cost functions, vectorization…

Would you like to read my derivation of the mathematics behind feedforward neural networks?

armanbm2013 · March 29, 2022, 8:26am

excellent in depth explanation about Forward and backward propagation
thanks for sharing it with us

Steven_Kao · April 2, 2022, 2:31am

Great resource thank you for putting it out Jonas.

One question I have with the Part 1 article - the definition of the activation seems a bit puzzling to be. Now it is defined as

Screen Shot 2022-04-02 at 10.08.03 AM

But I thought the activation function (g_j) is a function that takes in scalar value and also outputs scalar value. So I thought the equation should be

Screen Shot 2022-04-02 at 10.14.18 AM

This is at least my understanding from the current course 1. Would appreciate if anyone points out what I might have misunderstood… Thanks a lot again

jonaslalin · April 2, 2022, 6:51am

Great question. If you check out the second article in the series, you will find that I also derive the math for the softmax activation function. Since it uses an average, the values of all other nodes in the current layer are used to compute the value for one node. Hence, I need a more general formulation of the activation function than used in the lectures, where the derivation is left as an exercise.

Steven_Kao · April 2, 2022, 2:58pm

Thank you for the clarification! So I suppose this formula will start to make more sense once I start getting into the second course where we have multi-label classification

jamesturner246 · April 17, 2022, 8:53am

Hi @jonaslalin.

I note that you do not use a 1/m term in (13) and (14) of post one - i.e. your gradients are summed, rather than being averaged as in the lectures. Is this intentional?

In other words, should one average or sum the gradients of dw and db over all training examples?

Thanks.

jonaslalin · April 18, 2022, 11:55am

Yes this is intentional. In the lectures you compute dL/dW, dL/db. I never introduce a “loss” function, only the cost function J.

ljlz163 · May 8, 2022, 3:15pm

Why this happen? Thanks in advance.

jonaslalin · May 8, 2022, 4:27pm

I am very busy at the moment, but I will respond to your question as soon as I can. I believe more learners have the same question. It is true that I need to explain how I reach 13 and 14 in more detail.

Miles_Zhou · May 11, 2022, 4:24am

keep in mind that Z[l] = W[l]A[l-1]+b[l]. since we are trying to find dZ/dW, then we treat everything except dW as constant. thus dZ/dW = A[l-1]

TaeefNajib · May 11, 2022, 9:52pm

I don’t know if I could help, but here’s my answer:

As we know, for the l-th layer, z = w[l]a[l-1] + b[l]
So, dz[l] / dw[l] = d(w[l]a[l-1] + b[l]) / dw[l] = a[l-1] + 0 = a[l-1]

ljlz163 · May 29, 2022, 4:51am

That really helps me. Thanks.

Thunderstroke · June 25, 2022, 4:08pm

Amazing notes.
Thanks for your kindly explanation.

xujiali · July 6, 2022, 4:31pm

Hello, I am confused how did you get the formula with red line in my snapshot of Part2, Softmax? Could you please explain it a little more? Thank you.

MLzz · July 24, 2022, 9:05pm

In “Feedforward Neural Networks in Depth, Part1…”, equation (12), how do you proof this? It seems odd to me that summation of partial derivatives. Thanks in advance.

anon57530071 · July 25, 2022, 3:52am

I think it’s a simple multivariable calculus with a chain rule. Conditions given are;

\begin{align} u_i &= g_i(x_1, x_2, ..., x_j, ..., x_n) \\ y_k &= f_k(u_1, u_2, ..., u_i, ..., u_m) \end{align}

Then, a partial derivative of y_k with respect to x_j will be;

\begin{align} \frac{\partial y_k}{\partial x_j} &= \frac{\partial}{\partial x_j}f_k(u_1, u_2, .., u_m) \\ &= \frac{\partial}{\partial x_j}f_k(g_1(x_1, x_2, .., x_j, .., x_n), \ g_2(x_1, x_2, .., x_j, .., x_n), ..,\ g_m(x_1, x_2, .., x_j, .., x_n)) \\ &= \frac{\partial f_k}{\partial g_1}\frac{\partial g_1}{\partial x_j} + \frac{\partial f_k}{\partial g_2}\frac{\partial g_2}{\partial x_j} + ...... + \frac{\partial f_k}{\partial g_m}\frac{\partial g_m}{\partial x_j} \\ &= \sum_i\frac{\partial f_k}{\partial g_i}\frac{\partial g_i}{\partial x_j} \end{align}

As y_k = f_k(), and u_i = g_i(), I think the equation (12) is correct.

Venkat_Dhinakaran · August 15, 2022, 3:34pm

If you differentiate equation (1) with respect to w[l] you get that a[l-1] in equation (13).

Manuel_Angelini · August 31, 2022, 1:04pm

how can I download the articles in pdf format?

Pablo_Daniel11 · September 1, 2022, 5:19pm

Thank you!

Is there any way to download the articles in pdf?

Topic		Replies	Views
Deep learning from a mathematical view Neural Networks and Deep Learning coursera-platform	2	659	November 27, 2021
Week 4 Video: Forward and Backward Propogation Neural Networks and Deep Learning coursera-platform	1	562	July 10, 2021
Backward propagation equations for different activation functions Neural Networks and Deep Learning week-module-1 , coursera-platform	1	146	April 22, 2024
Optional Reading in Week 4 Neural Networks and Deep Learning week-module-4 , coursera-platform	4	49	August 9, 2024
Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks Neural Networks and Deep Learning coursera-platform	2	679	May 17, 2023

Feedforward Neural Networks in Depth

Related topics