I am stuck in the UNQ_C6 section. I would be very grateful if you could help me get out of this confusion.
First of all, where is cur_output_tokens obtained from? This value is not output until the input value is inserted into the model, right?
Why is the padding performed here, and why is the formula presented in the Click Here for Hints used to calculate the padding length?
The input value to the model is input token and target token, right? The fact that there is no target token in this code makes me think that I am mistaken. Anyway, I will ask a question because I don’t understand it well.
They are passed to a function as a parameter. For example, at the very first step in # UNQ_C7 they are an empty list, for the second step this list is appended with some (one) token that the next_symbol function produced and so on.
So, in other words, at the start (for as in # UNQ_C7), cur_output_tokens is an empty list, after one step the list should contain 1 element and so on.
But, leaving UNQ_C7 aside, you can pass any sequence of tokens as cur_output_tokens and get the prediction for the next one.
Padding is needed for the model - model expects an array with the certain lengths. In theory, for the batch_size of 1 (as in next_symbol function) this could work without padding.
The formula is presented in “Hints” because a lot of learners find it hard to calculate this quantity without guidance.
The inputs to this weeks model in UNQ_C6 are input_tokens which is tokens from English sentence and the targets are padded_with_batch which during training were tokens from German sentence but now are our predictions (padded).
My longer explanation previously.
Hi my life saver @arvyzukai ~!
I could finish this assignment thanks for you!
Your appropriate explanations were so helpful for me to understand what was going on the assignment.
I still have questions about padding and the formula presented in Hints~
The reason why padding is required for the data input to the decoder is because the operations of the deep learning model are based on matrix operations, and the current learning is supervised learning?
How was the formula calculating padding length derived? What is the principle behind it?
I could not locate this specific hint because the wording is a bit strange about supervised learning part. But in general, as I mentioned, during training we usually train with batches of examples.
In our case each example is a sentence (and a target is also a sentence); But sentences usually vary in length;
Matrix operations require that each matrix row to be of the same length;
So to meet the matrix operations requirement we make all sentences in a batch to be of the same length (by padding or truncating when needed).
The principle - many hardware architectures, especially modern GPUs, are optimized for power-of-two sizes (in other words, it’s better for efficient memory usage and parallelism). That is the reason for “power-of-two”.
The formula is just math. It just helps you to get 2, 4, 8, 16, 32, 64, … etc. For example, if your sentence length is 9 :
log_2(9+1) = ~3.322;
roundup(3.322) = 4;
2^(4) = 16;
So we need to pad the sentence (of 9) to be of length 16.