Hey @Carl_Merrigan,
That’s an intriguing question. Quite frankly, I was stumped for sometime as to from where should I begin to explain from, but let’s start anyways. Before that, let’s state what we already know:
\frac{\partial{J}}{\partial{H}} = \frac{1}{m} W_2^T(\hat{Y} - Y), \\
\frac{\partial{Z_1}}{\partial{W_1}} = X, \\
\frac{\partial{J}}{\partial{W_1}} = \frac{\partial{J}}{\partial{H}} \frac{\partial{H}}{\partial{Z_1}} \frac{\partial{Z_1}}{\partial{W_1}}
Now, the thing in focus is \frac{\partial{H}}{\partial{Z_1}}, where, we already know that H = ReLU(Z_1). We know that:
Z_1 > 0; H = Z_1; \frac{\partial{H}}{\partial{Z_1}} = 1 \\
Z_1 <= 0; H = 0; \frac{\partial{H}}{\partial{Z_1}} = 0
Now, the above derivatives are conditioned on Z_1, but using simple deductions, I can re-write these derivatives conditioned on H, as belows:
H > 0; \frac{\partial{H}}{\partial{Z_1}} = 1 \\
H = 0; \frac{\partial{H}}{\partial{Z_1}} = 0
Now, as per ReLU’s definition, H can’t be less than 0, so, it won’t hurt, if I would change the second condition above from H = 0
to H <= 0
, so, re-writing the derivatives, we get:
H > 0; \frac{\partial{H}}{\partial{Z_1}} = 1 \\
H <= 0; \frac{\partial{H}}{\partial{Z_1}} = 0
Now, if we take a close look at the conditions above, we will find that the derivatives are based upon ReLU(H)
, where H
happens to be the output of ReLU
. Bringing upon an analogy of backward-propagation, we can state that \frac{\partial{J}}{\partial{H}} is the output of ReLU, and \frac{\partial{Z_1}}{\partial{W_1}} is the input. So, we can define the derivative \frac{\partial{H}}{\partial{Z_1}}, based on ReLU(\frac{\partial{J}}{\partial{H}}).
This is something which I guess could be one of the reasons, but I believe that there should be an easier and more intuitive explanation to this. Let me tag in another mentor to shine some light here. Hey @arvyzukai, can you please take a look at this thread, and let us know your take on this? Thanks in advance.
Cheers,
Elemento