So I took a look at the derivation of the backward propagation last week, and as I see it again this week I’m a bit confused.
Forward prop:
z^{[L]}=w^{[L]T} a^{[L-1]} + b^{[L]}
a^{[L]}=g^{[L]}(z^{[L]})
First look at this:, using the chain rule
the chain rule says for functions
h(x) = f_1(f_2(x))
h ' (x) = f_1 ' (f_2(x))f_2 ' (x)
in this example,
f_1 (z) = g^{[L]}(z)
and
f_2(w) = z^{[L]}(w) which implicitly depends on w
f_1 ' (z) = g^{[L]'}(z) so f_1'(f_2(z))=g^{[L]'}(z^{[L]})
f_2 ' (w) = dz^{[L]} if we treat it as a differential so
h ' (w) = da = g^{[L]'}(z^{[L]})dz^{[L]}
but dz depends on w^T so we need to rearrange our expression for z to have a w in order to differentiate, whether implicitly or directly.
Let’s try multiplying both sides by w on the left. Implicitly, let’s just remember this is for L to simplify notation
wz = ww^Ta + wb
What if w is normalized so the weights sum to one, such as by regularization? Then ww^T=I and w is unitary
wz=a +wb
If we differentiate with respect to w as we did in the previous example
z+wdz = da + b+wdb
z-b=w^Ta so
w(w^T(a^{[L]})+da^{[L]})=da^{[L-1]}
but ww^T = I
the resulting equation is
a^{[L]}=da^{[L-1]}-da{[L]}
This looks like the equation for gradient descent, which is
a^{[L]}=a^{[L-1]}-\alpha da^{[L-1]}
but I think it’s actually the differential equation
a = -\frac{\partial a}{\partial w}
which has solution
a = a_0e^{-w} for some initial a_0
And that’s interesting because it’s regardless of g(z)
Especially since that da is for L-1
Clearly I’m doing something odd here. I was able to get the derivation to “work” before.
If there were instead an integral on that “differential equation” it would just be the equation for a conservative field, which could be evaluated at the two endpoints to get the result.
Probably what’s wrong with that final step of the analysis is that it is finite difference, rather than a differential equation, but there seems to be something wrong before that as well, since I’m not deriving the right equation.