Batch back propagation needs a bit more clarification

What is step(Z1)? Is this a step function of 0 and 1 on Z1 for the Relu definition? Shouldn’t we state this explicitly?

Where is the definition for DJbatch/Db2? Also, the batch trick for DJbatch/Db1 has to be applied here for batches, np.sum( , axis=1, keepdims=True)

1 Like

This was my question as well. But I think their choice to omit the explanation was understandable after all, because it requires a lengthy explanation of derivative function, which is a whole concept in itself from a Calculus class.

This was an excerpt from Andrew Ng’s basic ML course. Basically, what I could catch from this video was that the formula in blue is called a derivative function .

This was also another derivative function concerning the bias, where x[i] is now omitted.

Practically, I would plug in \hat{y}^{(i)} - y^{(i)} as well as x^{(i)} using the function of x in place of \hat{y}
Now, step function as far as I have googled is this:

So, applying this to our back_prop() function, this translates to the l2 with respect to z1, where if z1 =0, l2 is also 0.

The explanation can be summed up as a “rule of Calculus and Mathematics”, as stated by Andrew. I think there is a more fleshed out explanation as to how this derivative function comes about, e.g. This video.

That said, in practice, gradient descent is more efficiently implemented using python libraries. But we are on the same boat here, I would love to know the thoughts behind this mathematical function and how it came about.

Thanks, I was not looking for a lengthy explanation from them—just enough actionable knowledge for the assignment. Nevertheless, I will spend time digesting what you’ve written.

Since we are using batch, we should explain how to use batch where it applies.

Thank you for taking the time.