I find it interesting in the NN architecture that there are 2 GRU layers.
Question 1: why do we need 2 GRU layers? Why is one ‘gate’ not sufficient?
Question 2: why are the 2 GRU layers implemented slightly differently? i.e. why a dropout is added after batch normalization in the second GRU layer?
Thank you!