Great question. If you check out the second article in the series, you will find that I also derive the math for the softmax activation function. Since it uses an average, the values of all other nodes in the current layer are used to compute the value for one node. Hence, I need a more general formulation of the activation function than used in the lectures, where the derivation is left as an exercise.

11 Likes