I created and tried to train my own machine translation transformer model with the regular encoder decoder architecture, attempting to keep it as close to the original “Attention is all you need” paper.
The problem is that my model seems to output the same thing repeated without generating an token until many timesteps later. The outputs look something like this for an english to spanish translation.
>>> inference("i like to swim", model, dataset, DEV, 50)
'<SOS> me gusta nadar a nadar te gusta nadar a nadar a nadar les gusta nadar a me gusta nadar como me gusta nadar <EOS>'
The correct output should just be ‘me gusta nadar’, which the model has generated and then gone past, repeated previous outputs again and again.
What could be the reason for behaviour like this? For context, I trained the model for 20 epochs on a dataset with 130,000 translations. Do i keep training for longer?