I’d really like to know how the look-ahead mask is used in practice, in the decoder.
In the coding exercise, we created a lower triangular matrix, so that we can imitate masking all translated words (except for <SOS>), then all except the first translated output word, etc.
Is each training example then multiplied
seq_len times? That is, do we train each example in the mini-batch multiple times, once for every look-ahead mask value? Or does each training example get a randomly-assigned look-ahead mask, depending on their index in the mini-batch (the first example is fully masked, the second example gets masked from the 2nd word, etc…)
I’d love a clarification. Thanks so much!