Hi @Mayank11

Have you checked this post because it addresses most of the questions? In this post you can see all the shapes (you can change batch dimension of 1 to any number without any problem).

But to try to answer them again:

If I understand you correctly, the lecture shows how you can take one input (x_t) (one word, the embedding of that word, for exampe, tensor of shape (1, 512)), concatenate it with previous hidden_state (h_{t-1}) (tensor of shape (1, 512)) and do one matrix multiplication with W_ht (which shape should be (1024, 512)). This way you do just one matrix multiplication which result is equivalent to having two (W_hx and W_hh) weight matrices and two inpus (x and h).

Here the batch_dimension is 1, but it could be any number.

In other words, shift right makes your input to [0,54,23] and you have to predict target [54,23,35]. And of course, your sequence length (max_len) is not 3 like in this example, but longer (like 64 in the assignment) , which usually does not cut off the last word, but some padding token.

You misunderstand this concept. This is considered one time-step. First layer receives and input, produces the output which is the input for the layer “above it”, that layer receives this input which is the output for another layer that is “above it”. This is one time-step. When all the layers finished, the next token is the input and the whole thing repeats for every time-step.

They are hard to interpret for humans. Some are obvious like punctuation or POS, some seam completely random. A classic post about character level RNNs and the features they learn (in the middle of the article)

Cheers