Backward propagation derivation

So I took a look at the derivation of the backward propagation last week, and as I see it again this week I’m a bit confused.
Forward prop:

z^{[L]}=w^{[L]T} a^{[L-1]} + b^{[L]}
a^{[L]}=g^{[L]}(z^{[L]})

First look at this:, using the chain rule

the chain rule says for functions
h(x) = f_1(f_2(x))
h ' (x) = f_1 ' (f_2(x))f_2 ' (x)

in this example,

f_1 (z) = g^{[L]}(z)
and
f_2(w) = z^{[L]}(w) which implicitly depends on w

f_1 ' (z) = g^{[L]'}(z) so f_1'(f_2(z))=g^{[L]'}(z^{[L]})
f_2 ' (w) = dz^{[L]} if we treat it as a differential so
h ' (w) = da = g^{[L]'}(z^{[L]})dz^{[L]}

but dz depends on w^T so we need to rearrange our expression for z to have a w in order to differentiate, whether implicitly or directly.

Let’s try multiplying both sides by w on the left. Implicitly, let’s just remember this is for L to simplify notation

wz = ww^Ta + wb
What if w is normalized so the weights sum to one, such as by regularization? Then ww^T=I and w is unitary
wz=a +wb
If we differentiate with respect to w as we did in the previous example
z+wdz = da + b+wdb
z-b=w^Ta so
w(w^T(a^{[L]})+da^{[L]})=da^{[L-1]}
but ww^T = I
the resulting equation is
a^{[L]}=da^{[L-1]}-da{[L]}

This looks like the equation for gradient descent, which is
a^{[L]}=a^{[L-1]}-\alpha da^{[L-1]}

but I think it’s actually the differential equation
a = -\frac{\partial a}{\partial w}
which has solution
a = a_0e^{-w} for some initial a_0
And that’s interesting because it’s regardless of g(z)

Especially since that da is for L-1

Clearly I’m doing something odd here. I was able to get the derivation to “work” before.

If there were instead an integral on that “differential equation” it would just be the equation for a conservative field, which could be evaluated at the two endpoints to get the result.

Probably what’s wrong with that final step of the analysis is that it is finite difference, rather than a differential equation, but there seems to be something wrong before that as well, since I’m not deriving the right equation.

Oh maybe we’re trying to differentiate the L+1 equation with respect to a?

Then

dz^{[L+1]} = w^{[L+1]T} da^{[L]}

and we can say

da^{[L]}=g^{[L]'}(z^{[L]})dz^{[L]}=g^{[L]'}(z^{[L]})w^{[L]T} da^{[L-1]}

?

Maybe it’s AA^T=m is what’s normalized to be m times unitary.

Then we can take the transpose of the whole equation

z = w^T a +b

z^T = a^T w + b^T
then multiply on the left by
a
az^T = aa^T w + a b^T
a z^T = mw + a b^T
Differentiate with respect to w
da z + adz^T = m dw
That alone didn’t solve it
Since this is backward propogation, da for L-1 is 0 and this da corresponds to the index on a, which is L-1
so
a dz^T = m dw
for index L
but what about that transpose?
maybe it’s not defined as w^T then
dz^T A = m dw
for index L

Haven’t figured out the next step yet

new theory

for index L

a = g(z)
so z = g^{-1}(a)
so dz = g^{-1, ' }(a)da=\frac{da}{|g'(g^{-1}(a))|}
it is not really right to just write g ' (z) in the denominator, but if computed correctly, it would work. It is also not really right to remove the absolute value sign, but for many of the functions considered, it is okay
dz = \frac{da}{g ' (z)}
da = g ' (z) dz
and when matrix dimensions are considered, it becomes clear that has to be an elementwise product.

Okay, that’s a very non-mathy derivation, but it does kind of establish it.

Hello @s-dorsher,

I would stop at the line below because it was the chain rule for scalars but none of Forward prop’s variables is scalar.

I don’t see which equation you are trying to prove, but below is my example of applying the chain rule with only scalars:

Cheers,
Raymond

I have thought a while and decided to use a new symbol J' = mJ. Note that J' is still a scalar. The intention of my above post is to show that we can apply chain rule to the elements of matrices.

I guess you were looking at these equations?

My point is that, if your goal was to show them, we need to go elements.

In addition to Raymond’s excellent points, note that the forward propagation formulas are different for neural networks than for Logistic Regression. They are:

Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}
A^{[l]} = g^{[l]}(Z^{[l]})

The salient point being that there are no transposes involved. Here’s a thread which goes into a bit more detail on how Prof Ng got to that formulation.

Have you actually watched all the lectures in Week 3 and done the assignment? I would recommend doing the assignment first, before you go back and try to derive all the formulas from first principles. It will help to make sure we are clear on what the formulas actually are before we go to the trouble of deriving them again.

1 Like

Sorry for butting in, but why do you need the g^{-1}(a)?

It took me some (much!) time to get it, but when going “backwards”, one just multiplies with all the factors that one generated while “opening up” either

  1. the activation function (resulting in a factor g'(\mu) where u is “the current value at that point”, i.e. the z that is being fed into g(), or

  2. the linear function W*a+b which either simplifies to w if we are looking for the differentation of J by that w (and we are done) or else a factor w if we need to go deeper into a.

Maybe this diagram I made earlier will help

Also note that not all activation functions are injective. E.g. ReLU is not invertible.

1 Like

Note that weights are real numbers, meaning not just positive, right? They are not square matrices and they are not normalized in any way.

I suspect it was meant to say “the squares of weights sum to one”. I think the problem is, to get ww^T=I, w needs to be an orthogonal matrix, i.e. the rows have to be orthonomal vectors, which can’t be achieved that way.

In fact, w doesn’t have to be a square matrix, to begin with…

2 Likes

Hello, David @dtonhofer,

I feel that you were, accepting the forumlae as given, suggesting how to compute the gradients, while @s-dorsher was trying to prove something with the maths.

1 Like

There is also this item on the DLS FAQ Thread that has pointers to a number of websites that cover backpropagation explanations and derivations at various levels of detail.

Perhaps the one that is most detailed and most closely oriented as a supplement to the way the networks are presented here is Jonas Slalin’s website.

2 Likes

You are right I’m completely confused. Sorry about that.

Still, what are we trying to do?

It also seems to me there is a potential confusion between the mathematical dw and the Python accumulator variable dw :thinking:

P.S.

As the neural network applied to the test batch can be regarded as a multi-variable physical system in (W,B) with its own energy term J (uh oh … can we do something with Lagrangian Dynamics? Is there an Action? That’s above my education level) I posed the following question to ChatGPT:

has anyone ever applied runge-kutta integration to compute the weight evolution of a multilayer neural network

ChatGPT answered:

Yes

Elided elements…

Use Cases and Related Work

  1. Continuous-Time Neural Networks
  • Research on neural ODEs (such as Neural Ordinary Differential Equations by Chen et al., 2018) uses advanced numerical integrators, including RK methods, to evolve weights in continuous-time training.
  • This approach generalizes residual networks (ResNets), which can be seen as discretized ODEs.
  1. Adaptive Training Methods
  • Some optimization algorithms modify learning rates adaptively by considering second-order approximations, where RK methods can provide a more accurate trajectory for weight updates.
  1. Hamiltonian and Symplectic Neural Networks
  • Runge-Kutta-based methods are explored in physics-inspired architectures where learning dynamics are modeled as differential equations.

That’s still another branch that opens up!

For this, let’s hear from @s-dorsher :wink: He is our host of this thread.

The DLS is clear on this - dw always means \frac{\partial{J}}{\partial{w}}, though @s-dorsher might have used the notation d-something differently? For example,

Instead of \frac{\partial{J}}{\partial{a}}, here da likely means h'(w) = \frac{dh}{dw} = \frac{\partial{a}}{\partial{w}}.

Even if I was right about the difference, I think it’s still the host’s choice of notation and I have no problem with that, though we readers always need to read carefully.

Cheers,
Raymond

That’s forward propogation. I was discussing backward propogation, and I was using that formula

I know. I’ll try to rewrite this using tensor notation. That’s going to involve push-forwards and pull-backs, right?

No, I was doing math

1 Like

Thank you yes that is a notation issue

Okay well that makes sense if it’s a differential equation in some sense. I’ve used those to numerically solve differential equations before, including tensor ones (a small image of a black hole). I feel like I really need to check the math I did though. Thanks so much for looking this up!!!

If you know

a^[l]=g^{[l]}(w^{[l]}g^{[l-1]}(w^{[l-1]}...)+b^{[l]})
or
a^{[l]}=g^{[l]}(z^{[l]})

okay yeah I guess the question is whether dz means the derivative of a with respect to z or the derivative of z with respect to a? if its the derivative of z with respect to a, then it’s necessary to take the inverse of g before differentiating.

That’s because z^{[l-1]}=g^{[l],-1}(a^{[l]})
where the -1 in the exponent means inverse

Does anyone know?

ww^T is always square

suppose w has dimensions (m,n)
then w transpose has dimensions (n, m)

so ww^T has dimensions (m,m)