RNN Concepts too confusing

That is correct.

Yes, that would be the output of embedding layer and input to the RNN layer. (More on this)

I’m not sure I fully understand your question. What is the context of this? In other words what are referring here? To be short, for one output you need (embed_dim, 1), for more outputs you need (embed_dim, more_dim), but maybe I do not understand the context here.

The purpose of the ShiftRight layer here is to get the targets, so [“I”, “am”, “happy”] would become [“<sos>”, “I”, “am”, “happy”]. Check this post for more details.

For custom GRU check this post.

Stacking layers of Recurrent RNNs significantly improve the model’s ability to learn complex sequential patterns and dependencies in sequential data. Deep RNNs can improve generalization by learning a more compact and abstract representation of the data. The lower layers capture the specific details, while the higher layers learn more generalized features that are relevant to the task.
I’m not sure what you mean by “… the earliest we can get an output is after k+1 time step…”

Cheers

P.S. It is a better practice to ask one question a topic, this way others might find the same question/answers easier. And it also help to drive the conversation better (one answer might change the other question).