Questions regrading NLP course 3

Q-1: What is the purpose of this line?
lr_schedule=trax.lr.warmup_and_rsqrt_decay(400, 0.01)
I tried searching on Trax documentation but was unable to understand!

Q-2: Why are we equalling the input text length, in each batch?
In week 2 assignment we gave the max_length parameter to data_generator which applied to all the batches. And yeah, it makes sense to use same length for all training, validation, and test.

But in weeks 3 and 4 each batch contains different length input. Why?

Even more I don’t understand, why on the first place we are actually padding? RNNs are the way to deal with data whose length is not fixed! In simple neural network input size is fixed so to solve the problem we are using RNN. But now I see padding everywhere! Don’t know why we are doing this? Plus padding data to string changes the meaning of input. RNNs don’t know that padded 0s are just for equalling the input and I think padded bits also changes weights and biases!

Q-3: Why does data_generator from week 2 yield two Xs?
yield batch_np_arr, batch_np_arr, mask_np_arr

[I asked other questions also regarding Trax. If you can answer those as well plz do check out]
(Creating a GRU model using Trax)

Q1 There is an expanation about learning rate warmup (Section: 12.11.3.4. Warmup) :
https://d2l.ai/chapter_optimization/lr-scheduler.html#warmup

Q2: Because we are doing matrix multiplication and when you batch the data you have to have same length, otherwise you cannot dot multiply. If you would be going by a single example (input not batched) one input at a time you would not need to pad the sequence. The learning is faster with mini-batches.

RNN “knows” where inputs are padded with a help of the mask (mask_np_arr tells the model where the (padded) predictions do not matter and the model weights are not updated accordingly.)

Q3: The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.