Hi there,
I’m currently doing the optional/ungraded extra programming assignment regarding backpropagation in RNN.
Why do we need to calculate dxt and dx? We’re not updating the input x, so why bother calculating it?
Thanks,
Nikolaj
Hi there,
I’m currently doing the optional/ungraded extra programming assignment regarding backpropagation in RNN.
Why do we need to calculate dxt and dx? We’re not updating the input x, so why bother calculating it?
Thanks,
Nikolaj
It’s an interesting point. Well, there are different RNN architectures, right? E.g. there are some in which x^{<t>} = \hat{y}^{<t-1>}. And both forward and backward propagation treat the timesteps serially, just in the opposite order. So in that case you’d need dx^{<t>} in order to calculate gradients at timestep t - 1.
Unfortunately I’m not sure we can see a clear answer in what they show us in the notebook, because they don’t really put everything together into a complete back propagation solution here, like they did in DLS Course 1 with the simpler Fully Connected Feed Forward nets. The problem is that there are too many varieties of RNNs with different compute graphs. Recall that Prof Ng has shown us diagrams of a number of possible architectures already here in C5 W1: “many to one” and “many to many” and “many to many” in which T_x \neq T_y and so forth. In “real life”, we don’t have to take care of the back propagation side of things because TF just magically takes care of it for us.
Of course the high level point is that we only really care about the gradients for the various parameters (weights and biases), but just as in simpler architectures, we have to compute a lot of “chain rule” factors in order to be able to compute the things we really care about.
I gave the example at the beginning of a case in which dx might matter, but there will be cases in which it doesn’t. In that case, you just don’t apply the gradients. There was an analogy back in the “good old days” in DLS C1 where at each step of back prop, we calculate dA^{[l-1]}, dW^{[l]} and db^{[l]}. Since we need dW^{[1]} of course, that means we end up calculating dA^{[0]}, but as Prof Ng pointed out in that lecture A^{[0]} = X, so we just discard that gradient, since it doesn’t make sense to change the input.
Hi Paul,
Thanks as always for taking the time to clear things up a bit
We could theoretically also use this dx to find a specific input sequence for a desired output, or not? (if the model has been trained)
Gr.
Nikolaj
I don’t think I remember Prof Ng ever discussing anything like that, but maybe I just missed it. Remember that typically the gradients are generated from the J value calculated on the entire training set, so it’s pushing the parameters towards a lower overall cost on the whole batch not for any particular example. Just brainstorming here, but maybe you could take a fully trained model and then run it in “training” mode with just one sample as input, but then do you want to be applying gradients to the weights as well? Probably not, so you’d have to figure out how to apply the gradients only to the input values. Not sure how that would work, but if you’re interested you could do some googling and see if you can find any discussions of techniques like that. If no-one else has talked about this, maybe it’s worth some experiments to see where you can take it.