Trigger word on stream data

Let suppose I want to find trigger words in streamed data.

How does production system slide through the data?

The model clearly has defined window e.g. we take 10 sec, compute spectrogram, …
and we output vector with labels, but what strategies are used for moving to the next window (10s)?

I can basically see two options:
A) don’t care start from scratch
B) since the trigger word might actually be at the window boundary and both output from current window and the one from the next window could have problems with detection then we slide window with small overlap (overlap length ~= trigger word length)

What are other approaches used?

I can imagine slight refinement to the option B - during run over current window we save RNN activations that would be used as previous time step activations for the RNN cell at time t=T_{overlap} and in the next window run we don’t start with zeros as previous activations but use those saved as now t=0 corresponds to the time T_{overlap} from previous window (I hope that description makes sense).

But this is about anything I can think of now. Any other possibilities?

With regards
Andrzej

Hi Andrzej,

This issue was discussed in this thread.

Thank you Reinoud for pointing me to this thread.

The context of my question might be a bit different. I’m asking about strategies used in production systems.

The example that you have pointed to, is sliding whole 10s by 0.5s. Basically using whole 9.5s as a “run up”. Pretty inefficient (computationally) - but I fully understand that the person wanted to make use the model trained in the course and have “real time” answers. This is a form of “B” strategy but with very large overlap and without any attempt to reuse previously computed internal state (for the overlapping part).

Is this simple sliding window with some overlap what is being used in production? If so what is the typical “run up” overlap used? Are there any other options?

I’m asking since my target environment is a bit restricted when it comes to the computational resources, and I doubt that smart guys that work in this DL field just throw more CPU/GPU power at the problem without some clever tricks attached :slight_smile:. On the other hand that might be kind of “black magic” that companies would prefer not to share :wink:.

With regards
Andrzej

Hi Andrzej,

If you want more state-of-the-art you could look at the use of streaming transformers. You can find a discussion of that approach here.

Thank you Reinoud, will look at that.