I am currently working on Exercise 6 of the assignment for Week 1. For implementing the next_symbol
function, I see that the output tokens are padded so that the length of the list is a power of 2. Could you please explain what is the purpose of padding here? What would happen if we do not pad the list? It is my understanding the attention layer works with arbitrary sequences lengths.
After more thought, I realized that I was wrong: the sequence length for the attention layer has to be fixed. However, now my question is that how is the model trained on batches with different sequence lengths (which is the result of bucketing)?