The code ouline indicates that we are padding the questions to make the input in each batch have the same length. But we are not necessary making all batches have the same size. Does this means the batched could be of different sizes?
Apart from a drop in performance, if batches have different sizes, won’t this cause errors when calculating the loss function?
I did this specialization a while ago. Normally when you train a model you specify the batch size and thats probably the case here too. If you look more attentively into the code you will probably find that the batch size has been specified. The batch size is normally a power of 2 because this relates to the binary system of 1s and 0s so the computer calculations happen faster.
Sometimes when a training set or other set is divided into lets say x number of batches the last batch might be smaller than the others, because there are no images left, but thats not a big issue in terms of cost because the cost is averaged over all the batches. Also yes it makes sense that all batches have the same size so finding agregates is a consistent process.
Thanks for your response. Batch size is the same, except the last one which you mentioned. I was pointing at the padding length being different across the batches. I’m getting my head around how different padding size could cause issues down the line…
If I am understanding right the padding overall size will be the same but some sentences need more padding and some less because their original size is also different.
That was my assumption too but that’s not how it is implemented in the assignment. At the end of each batch, max_len is recalculated. So there is no mechanism to ensure the same max_len is applied to across all batches.
If this function goes though all the batches I think it will find the max_len for all the batches present and as far as I remember that assignment, it does. I am going to delete the code because its should not be made public.