Hi, I am working a part of the lab and I am having trouble understanding why 50 values of y^1 are set to 1 after the trigger word is finished as opposed to while the trigger word is being said. Any help is much appreciated!
Suppose the synthesized “activate” clip ends at the 5 second mark in the 10 second audio - exactly halfway into the clip.
Recall that 𝑇𝑦=1375��=1375, so timestep 687=687= int(1375*0.5) corresponds to the moment 5 seconds into the audio clip.
Set 𝑦⟨688⟩=1�⟨688⟩=1.
We will allow the GRU to detect “activate” anywhere within a short time-internal after this moment, so we actually set 50 consecutive values of the label 𝑦⟨𝑡⟩�⟨�⟩ to 1.
We have information only upto the current timestep. Since the current word that ends in the future is likely to be a trigger word, we have to wait for the word to be processed before deciding if the word is a trigger word.
Thank you for the response, but I am still not understanding. Since we know the end time of the word wouldn’t it make sense to make 50 values before the end time equal to 1 instead of after?
It’s impossible to know the future regarding what word will be said as long as we restrict ourselves to information up to the current timestep.
Another way to think about it is that RNNs can only summarize / process information till the current timestep.
If the above explanations aren’t clear, please go through the sequence models lectures from the start. It’ll clarify the difference between predicting the future and processing till the current timestep.