I don’t understand how can a Trigger Word Detection algorithm detects at real time. The algorithm we’ve implemented in the programming assignment needs to input an audio clip of 10 seconds. I know with the unidirectional architechture, the front part of the algorithm can make the prediction independently. But how should we adjust the real time input to make real-time predictions?
I think this would work:
Buffer the most recent 10 seconds of audio.
As each new sample arrives…
- Discard the oldest sample and add the new one.
- Compute the spectrogram.
- Pass the spectrogram through the model.