I am having a hard time in understanding RNN concepts. Below are some queries that I have. Any help is appreciated -

Lecture tells us that RNN processes input sequentially. Assuming that the use case is sentiment analysis.

For a sample input like “I am happy I am learning”

a. Is it right to assume that in first time step embedding for “I” will be sent to the model, then in 2nd time step embedding for “am” will be sent and so on till 6th step when embedding for “learning” will be sent to model?

b. Dimension of x should be (batch_size * max_len * embed_dim). In forward function, for concatenation to work when one element of x is sent to the model, dimension of h_t should have been (max_len * embed_dim) But dimension of h is (embed_dim,1). How come?

Layers supported in trax are also not clear -

a. The ShiftRight layer shifts input to the right by n_positions which for n_positions = 1 will shift our input to “0”, “am”, “happy”, “I”, “am”, “learning”. Isn’t this causing the first word (“I” here) to be ignored altogether? What is the purpose of ShiftRight layer?

b. I tried to pass just 1 word input to GRU layer. Below is what I did -

But I got error “Number of weight elements (6) does not equal the number of sublayers (3) in: GRU_5.”

What am I doing wrong? What does this sublayer values mean? Where will the model reflect n_units = 5 that I gave during initialization ?

The lecture says that deep RNNs are just RNNs stacked on top of each other. The input to next layer RNN is the output / activation of previous layer. Since this happens after 1 time step, does this mean that for a network of k RNNs stacked on top of each other, the earliest we can get an output is after k+1 time step? If so, what use cases can this possibly be used ?

Yes, that would be the output of embedding layer and input to the RNN layer. (More on this)

I’m not sure I fully understand your question. What is the context of this? In other words what are referring here? To be short, for one output you need (embed_dim, 1), for more outputs you need (embed_dim, more_dim), but maybe I do not understand the context here.

The purpose of the ShiftRight layer here is to get the targets, so [“I”, “am”, “happy”] would become [“<sos>”, “I”, “am”, “happy”]. Check this post for more details.

Stacking layers of Recurrent RNNs significantly improve the model’s ability to learn complex sequential patterns and dependencies in sequential data. Deep RNNs can improve generalization by learning a more compact and abstract representation of the data. The lower layers capture the specific details, while the higher layers learn more generalized features that are relevant to the task.
I’m not sure what you mean by “… the earliest we can get an output is after k+1 time step…”

Cheers

P.S. It is a better practice to ask one question a topic, this way others might find the same question/answers easier. And it also help to drive the conversation better (one answer might change the other question).

Thanks for getting back @arvyzukai . PFB some follow-up questions -

I am trying to point out that dimension of x is (batch_size * max_len * embed_dim) while dimension of h is (embed_dim). In forward_GRU function, just before multiplying with Whh, lecture asked us to vertically concatenate x, h_t. How can we vertically concatenate a (max_len * embed_dim) dimension matrix with (embed_dim *1) dimension matrix ?

The purpose of the ShiftRight layer here is to get the targets, so [“I”, “am”, “happy”] would become [“”, “I”, “am”, “happy”]. [Check this post]

My question here is - if input is embedding of one word at a time then why shift right? Wouldn’t this cause shift in the embedding dimension? eg if embedding for “a” is [[[54,23,35]]], it would become [[[0,54,23]]] which could be a different word altogether in that space.

I’m not sure what you mean by “… the earliest we can get an output is after k+1 time step…”

The output of 1 time step of GRU is y(t) which becomes input to next layer of RNN. Similarly output of this layer of RNN is input to the third layer of RNN. By the time 3rd layer receives first input, 2 timesteps have passed.

The lower layers capture the specific details, while the higher layers learn more generalized features that are relevant to the task.

This is interesting but I wonder what specific details / generalized features mean here - we got semantic dependencies from embedding, we have also learnt about algo dedicated to PoS tagging etc… What are these other generalized features? How deep we should go for use cases like machine translation or sentiment analysis?

Finally, apologies for this long post. All my questions revolve around dimensions and relationship between timesteps and hence I asked them all here. I will try to be more specific in future.

Have you checked this post because it addresses most of the questions? In this post you can see all the shapes (you can change batch dimension of 1 to any number without any problem).

But to try to answer them again:

If I understand you correctly, the lecture shows how you can take one input (x_t) (one word, the embedding of that word, for exampe, tensor of shape (1, 512)), concatenate it with previous hidden_state (h_{t-1}) (tensor of shape (1, 512)) and do one matrix multiplication with W_ht (which shape should be (1024, 512)). This way you do just one matrix multiplication which result is equivalent to having two (W_hx and W_hh) weight matrices and two inpus (x and h).
Here the batch_dimension is 1, but it could be any number.

In other words, shift right makes your input to [0,54,23] and you have to predict target [54,23,35]. And of course, your sequence length (max_len) is not 3 like in this example, but longer (like 64 in the assignment) , which usually does not cut off the last word, but some padding token.

You misunderstand this concept. This is considered one time-step. First layer receives and input, produces the output which is the input for the layer “above it”, that layer receives this input which is the output for another layer that is “above it”. This is one time-step. When all the layers finished, the next token is the input and the whole thing repeats for every time-step.

They are hard to interpret for humans. Some are obvious like punctuation or POS, some seam completely random. A classic post about character level RNNs and the features they learn (in the middle of the article)