Hi and thanks for reading my question.
I don’t understand why in backpropagation steps we compute derivatives with respect to input x.
we never change the sentence words or one-hot representation of them, do we? so why do we calculate derivatives with respect to them?
thanks again.
Hi Hamed,
Thanks for your question. According to the way RNNs are designed, the backpropatation needs to cover all parameters of the cell on the way back, as you can see in this picture:
This includes the derivate of the time sample x(t). The same as in the forward propagation you are propagating forward the processed x(t), you need to do the same propagation backwards but with the dx(t). At least this is how I understood it.
Hope it helps.
Happy learning,
Rosa
Thanks @HamedGholami for your question, I was wondering about that, too.
@arosacastillo, why does backpropatation need to cover ALL parameters of the cell on the way back? Unlike da_\text{prev}, dx^{<t>} is not used later on anywhere, is it?
My guess is, that it is useful for debugging: some XAI approaches seem to use input gradients to create inputs leading to most confident outputs, e.g. for specific classess, in the hope to gain some insights from the created inputs. Furthermore, looking at gradients seems to be a debugging approach, but I have no idea how it helps (besides seeing vanishing or exploding gradients of weights) and whether gradients of inputs also help.
Hi David,
Sorry for the late reply. I am not an expert in RNNs but I will share my thoughts here about your question. Indeed in the architecture shown above the dx(t) seems not to be used, however there could be other RNNs architectures where those values are used.
Best wishes,
Rosa
Would be nice to hear more on this. I’m also pondering why we calculate dx_t if its never used anywhere.
Apart from debugging, one additional point came from somewhere - that e.g. in case of Generative approaches - the starting inputs will matter too.
But still would be great to hear more substantiated opinion from experts