I felt an induction step on Z looks simpler than an induction step on A…
Here is a summary.
specs:
a(l)=g(z(l))
z(l)=w(l)'a(l-1)+b(l)
j=-(ylog(a)+(1-y)log(1-a))
J=sum(j)/m
back prop induction:
dZ(l-1)=w(l)dZ(l) * g’(Z(l-1)) [stores j-derivatives columnwise, each example is 1 column]
dw(l-1)=A(l-2)dZ(l-1)’/m [J-derivative]
db(l-1)=sum(dZ(l-1) over columns)/m [J-derivative]
Though, I dont know much about the literature and any specific advantages the original representation may provide. Look forward to your views.