Trigger word on stream data

andrzejo · September 13, 2021, 9:25pm

Let suppose I want to find trigger words in streamed data.

How does production system slide through the data?

The model clearly has defined window e.g. we take 10 sec, compute spectrogram, …
and we output vector with labels, but what strategies are used for moving to the next window (10s)?

I can basically see two options:
A) don’t care start from scratch
B) since the trigger word might actually be at the window boundary and both output from current window and the one from the next window could have problems with detection then we slide window with small overlap (overlap length ~= trigger word length)

What are other approaches used?

I can imagine slight refinement to the option B - during run over current window we save RNN activations that would be used as previous time step activations for the RNN cell at time t=T_{overlap} and in the next window run we don’t start with zeros as previous activations but use those saved as now t=0 corresponds to the time T_{overlap} from previous window (I hope that description makes sense).

But this is about anything I can think of now. Any other possibilities?

With regards
Andrzej

reinoudbosch · September 13, 2021, 11:04pm

Hi Andrzej,

This issue was discussed in this thread.

andrzejo · September 14, 2021, 11:24am

Thank you Reinoud for pointing me to this thread.

The context of my question might be a bit different. I’m asking about strategies used in production systems.

The example that you have pointed to, is sliding whole 10s by 0.5s. Basically using whole 9.5s as a “run up”. Pretty inefficient (computationally) - but I fully understand that the person wanted to make use the model trained in the course and have “real time” answers. This is a form of “B” strategy but with very large overlap and without any attempt to reuse previously computed internal state (for the overlapping part).

Is this simple sliding window with some overlap what is being used in production? If so what is the typical “run up” overlap used? Are there any other options?

I’m asking since my target environment is a bit restricted when it comes to the computational resources, and I doubt that smart guys that work in this DL field just throw more CPU/GPU power at the problem without some clever tricks attached . On the other hand that might be kind of “black magic” that companies would prefer not to share .

With regards
Andrzej

reinoudbosch · September 14, 2021, 2:59pm

Hi Andrzej,

If you want more state-of-the-art you could look at the use of streaming transformers. You can find a discussion of that approach here.

andrzejo · September 16, 2021, 2:16pm

Thank you Reinoud, will look at that.

Topic		Replies	Views
Question on trigger word detection Sequence Models	1	539	April 21, 2022
C5W3 - Missing intuition on positive dataset marking with trigger word detection Sequence Models week-3	14	209	June 1, 2024
Sequence Models Week 3 Assignment 2 Question Sequence Models	3	387	August 22, 2023
Sequential Models: Trigger word attention Sequence Models	8	281	November 27, 2023
C5W3: Is real-time detection really possible? Sequence Models	3	568	September 12, 2021

Trigger word on stream data

Related topics