Feedforward Neural Networks in Depth

Forward propagation, backward propagation, various activation functions, various cost functions, vectorization… :exploding_head:

Would you like to read my derivation of the mathematics behind feedforward neural networks?


excellent in depth explanation about Forward and backward propagation
thanks for sharing it with us


Great resource thank you for putting it out Jonas.

One question I have with the Part 1 article - the definition of the activation seems a bit puzzling to be. Now it is defined as

Screen Shot 2022-04-02 at 10.08.03 AM

But I thought the activation function (g_j) is a function that takes in scalar value and also outputs scalar value. So I thought the equation should be

Screen Shot 2022-04-02 at 10.14.18 AM

This is at least my understanding from the current course 1. Would appreciate if anyone points out what I might have misunderstood… Thanks a lot again :pray:


Great question. If you check out the second article in the series, you will find that I also derive the math for the softmax activation function. Since it uses an average, the values of all other nodes in the current layer are used to compute the value for one node. Hence, I need a more general formulation of the activation function than used in the lectures, where the derivation is left as an exercise.


Thank you for the clarification! So I suppose this formula will start to make more sense once I start getting into the second course where we have multi-label classification


Hi @jonaslalin.

I note that you do not use a 1/m term in (13) and (14) of post one - i.e. your gradients are summed, rather than being averaged as in the lectures. Is this intentional?

In other words, should one average or sum the gradients of dw and db over all training examples?


1 Like

Yes this is intentional. In the lectures you compute dL/dW, dL/db. I never introduce a “loss” function, only the cost function J.


Why this happen? Thanks in advance.

1 Like

I am very busy at the moment, but I will respond to your question as soon as I can. I believe more learners have the same question. It is true that I need to explain how I reach 13 and 14 in more detail.

1 Like

keep in mind that Z[l] = W[l]A[l-1]+b[l]. since we are trying to find dZ/dW, then we treat everything except dW as constant. thus dZ/dW = A[l-1]


I don’t know if I could help, but here’s my answer:

As we know, for the l-th layer, z = w[l]a[l-1] + b[l]
So, dz[l] / dw[l] = d(w[l]a[l-1] + b[l]) / dw[l] = a[l-1] + 0 = a[l-1]


That really helps me. Thanks.

1 Like

Amazing notes.
Thanks for your kindly explanation.

Hello, I am confused how did you get the formula with red line in my snapshot of Part2, Softmax? Could you please explain it a little more? Thank you.


In “Feedforward Neural Networks in Depth, Part1…”, equation (12), how do you proof this? It seems odd to me that summation of partial derivatives. Thanks in advance.

I think it’s a simple multivariable calculus with a chain rule. Conditions given are;

\begin{align} u_i &= g_i(x_1, x_2, ..., x_j, ..., x_n) \\ y_k &= f_k(u_1, u_2, ..., u_i, ..., u_m) \end{align}

Then, a partial derivative of y_k with respect to x_j will be;

\begin{align} \frac{\partial y_k}{\partial x_j} &= \frac{\partial}{\partial x_j}f_k(u_1, u_2, .., u_m) \\ &= \frac{\partial}{\partial x_j}f_k(g_1(x_1, x_2, .., x_j, .., x_n), \ g_2(x_1, x_2, .., x_j, .., x_n), ..,\ g_m(x_1, x_2, .., x_j, .., x_n)) \\ &= \frac{\partial f_k}{\partial g_1}\frac{\partial g_1}{\partial x_j} + \frac{\partial f_k}{\partial g_2}\frac{\partial g_2}{\partial x_j} + ...... + \frac{\partial f_k}{\partial g_m}\frac{\partial g_m}{\partial x_j} \\ &= \sum_i\frac{\partial f_k}{\partial g_i}\frac{\partial g_i}{\partial x_j} \end{align}

As y_k = f_k(), and u_i = g_i(), I think the equation (12) is correct.


If you differentiate equation (1) with respect to w[l] you get that a[l-1] in equation (13).

how can I download the articles in pdf format?

1 Like

Thank you! :wink:

Is there any way to download the articles in pdf?