Week3 - Trigger word detection - Why do we need 2 GRU layers

I find it interesting in the NN architecture that there are 2 GRU layers.

Question 1: why do we need 2 GRU layers? Why is one ‘gate’ not sufficient?

Question 2: why are the 2 GRU layers implemented slightly differently? i.e. why a dropout is added after batch normalization in the second GRU layer?

Thank you!

NN architecture is usually constrained by memory and computational resources. To achieve the desired accuracy, the notebook author might have found this architecture to be effective.

Feel free to try a custom NN and observe how results change.