W3A2 Trigger_word_detection_v2a

What is the intuition of the model (figure 3) used in this assignment? Specifically, why are two GRU layers used instead of one?

Hi @Fusseldieb

This question probably belongs to another Course (there is no trigger word detection in NLP Specialization). But I can try to answer your question because it’s the same for all RNNs (the application does not matter).

In general, you stack RNNs (GRUs in your case) when you want to have more complexity from the input patterns (in other words, the inputs are complex and you want to capture that). Single layer might not be enough to capture the complexity of the sequence.

The sequence information is the reason for another GRU and not a Linear (Dense) layer. It is somewhat similar to when you have Deep Neural Network (Dense layers) to represent a single example in other applications (like predicting MNIST digit) to capture hierarchical level of features. Likewise with multiple RNN layers - you try to capture hierarchical level of features but with sequence in mind - where previous input could/should influence next input representation in the sequence.

The last layer is usually a Linear (Dense) layer for application (classification/regression) purposes.


1 Like


Thanks for the answer. It’s true, the topic should have been “Deep Learning Specialisation” not “NLP Specialisation”.

It seems, just as with deep NNs, sort of random how many layers are used. Is this always trial and error? Sort of not very scientific!


Hi @Fusseldieb

It’s not totally random approach, it’s somewhat systematic search (scientific) and it depends on the problem at hand and your experience (art).

The art part is that you already tried different number of layers and nodes for similar datasets, so you kind of have a “feeling” how many, what type etc., you need to start iterating.

The scientific part is that there are different hyperparameter optimization techniques.

For example, one simple approach for illustration:

  • just take the mean (or mode) of the labels (predicted variable) and predict that (the dumbest model possible and measure further models against that (on accuracy or whatever metric is most important));
  • then the simplest possible configuration - a linear regression (one layer, number of nodes is equal to number of features); and check how much you improved;
  • then start increasing the numbers of layers (and nodes) and check for the rate of improvement; the increase of layers could be exponential (like 2, 4, 8, 16) so you could find the upper limit early and not waste compute. The complexity of your data might indicate if you need more of a deep network or a wide network.
    This step (iterations on different architectures) is not very straightforward (the reason for “might” in previous sentence) because sometimes there are jumps (non-linear increase) in performance when you train for “long enough”.

Of course, the approaches differ for tabular data vs. vision vs. natural language etc. For example, from experience you might know that you would xgboost for tabular data, CNNs for vision, Transformers for natural language. And of course research/science vs. production/commerce is also a big factor of what you might try.

So, it’s a combination of scientific and “creative” or random approach.


1 Like

Just noticed this thread and you’re right that it is about DLS Course 5 Week 3, so I moved the thread category by using the little “edit pencil” on the title.