There is something wrong with the implementation because it was not intended that way.
The most obvious places to look for mistakes is UNQ_C6 next_symbol() and UNQ_C7 sampling_decode functions. In particular, how do you check for “end of sequence” token in sampling_decode() or how do you choose the next symbol in next_symbol() function (you can find some hints here).
I have doubled checked many time. For next_symbol() I chose the next token by log_probs = output[0,token_length,:], where token_length is the unpadded token length from the cur_output_tokens.
For sampling_decode(), the end of the sentence is determined by "while cur_output != EOS: ".
It seems my translated sentence in German (back to English) is “I love love the languages languages…”" There is a tendency to repeat every word. Does it mean that my log_probs should be taking the next item output[0,token_length+1,:]? I tried this and it doesn’t work either. Same output. by counting the position it doesn’t make sense as token length as an index is one position after current token because index start from 0.
For symbol, after casting to int() the log_softmax, do we need to use ind2Word to convert the index to the tensor of the word? It doesn’t seem to suggest this as ind2Word was not passed to the function and I think we are not encourage to use global functions within functions?
A follow up question, in next_symbol function, when padding the length of the inputs for the cur_output_tokens, why do we use the length of the cur_output_tokens length to calculate the power of 2 padded length? why don’t we use the input token length to calculate the padded length as the input token length is most of the time longer than the partially translated cur_output tokens? does the padded length of input tokens and cur_output_tokens need to be the same or different? Or are they not correlated?
No, we have function for that which has vocab_file as a parameter. detokenize(...) is and should be a global function because we don’t want different behaviors of detokenization for different functions.
Padding is needed for batch processing. The model was trained on mini-batches, so all sentences for training in a mini-batch should have the same length (to form matrices, which by definition should have the same number of columns for each row).
In theory, the next_symbol(...) function (the inference) does not need padding and it works without padding (I’m not sure why the course creators asked learners for it). For example, padded = cur_output_tokens +  would work without problems and it would make the last predicted symbol selection simpler (with just -1 index).
They are correlated but does not have to be the same.
These are two separate matrices for training and they can have different number of columns - thus can be different column count (padding). But they are correlated since the longest English sentence would correlate with the longest German sentence in the mini-batch (longest English sentence length \approx longest German sentence length).
Oh my lord. Thank you for this last summary comment. Turns out I was using 2^ instead of 2** and that was causing my issue. Thank you PZ2004 for posting the issue and arvyzukai for the solution and where to look.