Why calculate dxt and dx?

Nikolaj · January 15, 2024, 8:41pm

Hi there,

I’m currently doing the optional/ungraded extra programming assignment regarding backpropagation in RNN.

Why do we need to calculate dxt and dx? We’re not updating the input x, so why bother calculating it?

Thanks,
Nikolaj

paulinpaloalto · January 15, 2024, 9:13pm

It’s an interesting point. Well, there are different RNN architectures, right? E.g. there are some in which x^{<t>} = \hat{y}^{<t-1>}. And both forward and backward propagation treat the timesteps serially, just in the opposite order. So in that case you’d need dx^{<t>} in order to calculate gradients at timestep t - 1.

Unfortunately I’m not sure we can see a clear answer in what they show us in the notebook, because they don’t really put everything together into a complete back propagation solution here, like they did in DLS Course 1 with the simpler Fully Connected Feed Forward nets. The problem is that there are too many varieties of RNNs with different compute graphs. Recall that Prof Ng has shown us diagrams of a number of possible architectures already here in C5 W1: “many to one” and “many to many” and “many to many” in which T_x \neq T_y and so forth. In “real life”, we don’t have to take care of the back propagation side of things because TF just magically takes care of it for us.

Of course the high level point is that we only really care about the gradients for the various parameters (weights and biases), but just as in simpler architectures, we have to compute a lot of “chain rule” factors in order to be able to compute the things we really care about.

I gave the example at the beginning of a case in which dx might matter, but there will be cases in which it doesn’t. In that case, you just don’t apply the gradients. There was an analogy back in the “good old days” in DLS C1 where at each step of back prop, we calculate dA^{[l-1]}, dW^{[l]} and db^{[l]}. Since we need dW^{[1]} of course, that means we end up calculating dA^{[0]}, but as Prof Ng pointed out in that lecture A^{[0]} = X, so we just discard that gradient, since it doesn’t make sense to change the input.

Nikolaj · January 16, 2024, 9:03pm

Hi Paul,

Thanks as always for taking the time to clear things up a bit

We could theoretically also use this dx to find a specific input sequence for a desired output, or not? (if the model has been trained)

Gr.
Nikolaj

paulinpaloalto · January 16, 2024, 11:25pm

I don’t think I remember Prof Ng ever discussing anything like that, but maybe I just missed it. Remember that typically the gradients are generated from the J value calculated on the entire training set, so it’s pushing the parameters towards a lower overall cost on the whole batch not for any particular example. Just brainstorming here, but maybe you could take a fully trained model and then run it in “training” mode with just one sample as input, but then do you want to be applying gradients to the weights as well? Probably not, so you’d have to figure out how to apply the gradients only to the input values. Not sure how that would work, but if you’re interested you could do some googling and see if you can find any discussions of techniques like that. If no-one else has talked about this, maybe it’s worth some experiments to see where you can take it.

Topic		Replies	Views
Week 1 assignment 1 backprop; calculating dx Sequence Models	4	561	December 10, 2023
One question about back propagation of RNN Sequence Models	2	523	December 3, 2021
RNN Assignment-1 Week 1 Sequence Models	2	685	April 21, 2022
Why is there no A0 or X in the backward chain? Week4, Assignment1 Neural Networks and Deep Learning week-4	10	96	June 30, 2024
Week1 Assignment1 Backpro question Sequence Models	3	591	August 16, 2021

Why calculate dxt and dx?

Related topics