What I am confused is that how would forward prob and backward prop be done.
As there are two independent flows, one in forward direction, one is backward direction. Are there 2 back props ? If yes, what is the loss function for flow from end to start
Is this understanding correct:
-
Forward and backprop are performed independently for both directions and the loss function is calculated by predicting the output for both the directions independently.
-
The process is repeated for all training examples for all epochs
-
During testing stage output Y is predicted again by concatenating the values of the hidden outputs , but this time W_y are not updated in any way. The value of W_y is what is obtained during step 1-2 during training phase.
Hi @Ritu_Pande
I’m not sure I understand you correctly.
Forward and backward propagation is not performed independently for both directions (or for any architecture linear layer, bi-directional or non-bidirectional rnn etc.) since the prediction has to match the target (usually all layers in the model are connected - every output influence other’s input). Same goes for bi-directional (or not) RNNs.
But if what you mean is that the calculations of each direction is independent (which is a different thing from forward and backward prop calculations) then yes - you can imagine having two different RNNs going from both directions.
Image from the Schuster and Paliwal, 1997 paper.
For a simple example with numbers, you can imagine one direction producing two numbers for each time step (let’s imagine [0.1, -0.3] for step 1 (token 1) for forward facing RNN) and two additional numbers from the other direction ([0.2, 0.1] for step n (token 1) for backward facing RNN, which might be step 31 from its perspective), what happens is that these two different hidden states are concatenated into [0.1, -0.3, 0.2, 0.1] (note, very rarely they are summed into [0.3, -0.2], averaged or other operation) and passed for the subsequent layers.
When doing a backprop these ( [0.1, -0.3, 0.2, 0.1] or [0.3, -0.2] ) numbers are evaluated in accordance how well did they contribute for the prediction (do they need to be increased or decreased).
What happens at inference depends on what your application is - for example, if it is a sentiment classification of movie review, then you feed the whole review (and the model calculates both directions until it outputs the prediction - the sentiment.
Cheers
arvyzukai, @Thanks for sharing the original paper
I had assumed that the loss for for both the RNN directions are calculated separately for their individual backpropagation and the hidden states from both directions was mixed to calculate y_pred only during the prediction state and not the training phase. This understanding was wrong. I understand now that the loss is calculated only once for both the directions based on the algorithm below