C5W3: Is real-time detection really possible?

In ‘assignment 2: trigger word detection’, I noticed following statement:

However, the trigger word detection system that we’ve trained in this assignment can only take 10-second-window audios as input, which means we can only get the input and output every 10 seconds anyway.

And only then can we figure out when we said the trigger word (where those ones are).

So how can we “detect the trigger word almost immediately after it is said” as the statement above has implied.

Hi wong-1994,

If you listen to one of the audio snippets, you will notice that there can be a few seconds of other sound after the trigger word was said. So it does make a difference if the system responds immediately or not.

But the statement can also be taken as a more general explanation of why in a case such as this, a unidirectional RNN makes more sense than a bidirectional RNN.

I hope this clarifies.

1 Like

Hi Bosch,

Thanks for replying. But I’m still confused :sweat_smile: .

If you’re talking about the chime sound, I did hear that. But still, the chime sound was manually added after the output is generated.

To my understanding, the pipeline is:

  1. Get the 10-second-window audio;
  2. Input the audio to trigger word detection system;
  3. Get the output;
  4. Find out where the “ones” are located, then add the chime.

But in real world, let’s assume the device starts to listen at 0 second, and I say “activate” at the 5th second. The device has to wait until the 10th second to get the right-shape-input.

And although later it can decide when I said “activate” during these 10 seconds, it’s just impossible to go back in time and then create a chime at the 5th second.

So in real world, how can a device react just after the trigger word?

Hi wong-1994,

You are right that this is not implemented in the assignment. You can have a look here to see an example of how to extend this to real time trigger detection.