I’m slightly confused by the trigger word detection model in week 3. Based on the instruction, it seems like the model should be able to detect the trigger word immediately after it is said, which is the main reason for it to use a unidirectional instead of a bidirectional RNN – “If we used a bidirectional RNN, we would have to wait for the whole 10sec of audio to be recorded before we could tell if “activate” was said in the first second of the audio clip”. However, Isn’t it the case that we will have to wait the whole 10 secs of audio to be finished first before we could pass that into the 1D CONV layer? If so, how could it be detected immediately after it is said?
Thank you in advance!