I finished the trigger_word_detection assignment, but I’d love some insight into why you designed the model the way you did. Here are some specific questions:
- Dropout 0.8 - this drops out 80% of the inputs, right? This seems really high. Why would you do it that way? tf.Dropout
- When looking at the picture, you have it as vertical stacks at each X. But conceptually, it’s really only the GRUs that “feed” context from one input to the next. So you could also have drawn it where every layer except the GRU layer was horizontal, taking in a full width input, right? (Kind of like the Conv layer - but you could have drawn that with a single input). Correct? This is just so I understand the concept. There’s obviously nothing wrong with this drawing.
- Conv 1d - why only one layer? It seems multiple layers would be perfect since convolutions are kind of designed for this. And could you have designed a trigger word system mostly with 2d convolutions?
- Why multiple layers of GRU?
- Why did you choose Dropout for this problem, when we usually don’t?
- Why batch norm here? Should we generally consider adding Norm and Dropout?
We are getting near the end of the course and I’d like to build my own networks, and no way I would have come close to this design, so I’d like more insight to help guide me.
Thank you.