Why are sentences right-padded with 0s indices to max sentence length, instead of left-padded (adding 0s to the beginning of the sentence)? Intuitively, given that this is many-to-one recurrent network, I would think that the model would work better if they were left padded.
My guess is that it would work equally well either way, and that the choice is arbitrary.
Old thread, I know.
Milan, I think your intuition is assuming that as we start shifting in the zeros that the LSTM starts to “forget” the memory states it had learned. I think if you were to feed in a windowed version of the data, like, the last five words, then you’d be right. Because on a short sentence, the last info it’d have seen would be all zeros. But I think in this case the LSTM has full context to the entire sentence.
In cases like this, where we’re going with intuition and not a technical definitive answer, I like to ask myself, could you perform the task as a human? If so, it probably can too given a powerful enough model.
e.g. If I showed you:
“0 0 0 0 0 0 0 0 0 0 I’m feeling happy today”
vs.
“I’m feeling happy today 0 0 0 0 0 0 0 0 0 0”
My guess is seeing enough such sentences and assuming you can fit the entire sentence in your memory, it isn’t going to affect your performance!