Real-time Trigger Word Detection

I don’t understand how can a Trigger Word Detection algorithm detects at real time. The algorithm we’ve implemented in the programming assignment needs to input an audio clip of 10 seconds. I know with the unidirectional architechture, the front part of the algorithm can make the prediction independently. But how should we adjust the real time input to make real-time predictions?

I think this would work:
Buffer the most recent 10 seconds of audio.
As each new sample arrives…

  • Discard the oldest sample and add the new one.
  • Compute the spectrogram.
  • Pass the spectrogram through the model.