A follow up question about Trigger Word Detection

Here is a follow up of one post about Trigger word detection.

I want to know the implementation detail if want to make the model be applied in production environment. In the project the model made up of Conv1d layer which followed by 2 GRU layer which followed by Dense layer wrapped with TimeDistributed function. There is on answer mentioned using buffer, but how it’s implemented in tensorflow to assume all the real time data read into the model belong to a single sample? Per my understanding, the GRU status only kept if reading in a sequence from a single sample, if a new sample read, it’s internal status would be reset and previous memory lost.

Update,
Seems didn’t see the value provided by TimeDistributed wrapper, experimented the following sample code, outputs are exact the same.

inputs = tf.random.normal([10,3,2])
dense = tf.keras.layers.Dense(5)
outputs = dense(inputs)
print(outputs)

# inputs = tf.random.normal([10,8,6])
outputs = tf.keras.layers.TimeDistributed(dense)(inputs)
print(outputs)

You don’t need the TimeDistributed in tensorflow 2.x
@jonaslalin first made this observation.

1 Like

Your understanding of RNN layer internals is correct. We set previous state to 0s at start of each batch of input. A batch can contain sequences of different lengths which is why padding is done to make the dimension rectangular.

One detail to consider is that it’s possible to detect a trigger word overlapping boundaries of a window. Depending on your implementation, it might be okay to input overlapping windows of audio to detect a trigger word. As @TMosh says, a rolling buffer should be sufficient for this purpose.

If I understand it correctly about your overlapping method, let’s say a buffer of size 5, the sample recorded at sometime and feed into the buffer with time sequence data [d1, d2,d3,d4,d5] for the model to consume, the next time a new time sequence data feed into the buffer but with [d3,d4,d5,d6,d7] rather than starting from d6, thus even the memory of d1,d2 lost, there are still memory of d3,d4,d5 kept in the model while reading d6 (even though the memory is rebuilt from new feed of old data) and if someone say ‘Activate’ recorded in across d4,d5,d6 we can still be alerted right?

Hey @balaji.ambresh I have another confusion about unidirectional RNN in the notebook, it said:

* Note that we use a unidirectional RNN rather than a bidirectional RNN. 
* This is really important for trigger word detection, since we want to be able to detect the trigger word almost immediately after it is said. 
* If we used a bidirectional RNN, we would have to wait for the whole 10sec of audio to be recorded before we could tell if "activate" was said in the first second of the audio clip.  

Per my understanding for a sample of 10 sec record feed into the model we built, it won’t feed any data into the first GRU layer before the Conv1 layer finished processing the whole sequence, and the first GRU layer won’t feed any data into second GRU before it process the sequence is done, and the same for second GRU to Dense+Sigmoid later, then how can it be detected almost immediately after it said?

The model doesn’t remember previous state across batches since we feed 0s as previous state at start of batch. It’s like asking 2 seperate questions:

  1. Is there a trigger word in [d1, d2,d3,d4,d5] ?
  2. Is there a trigger word in [d3,d4,d5,d6,d7] ?

Bi-directional RNNs (BRNN) process inputs from both ends. Please read the following replies:

  1. Bidirectional layer for time series forecasting - #4 by balaji.ambresh
  2. Difference between BRNN vs GRU as far as scenario output is concerned - #2 by balaji.ambresh

Hope it’s now clear why it’s important to make the entire input sequence available for making a prediction. All unidirectional RNNs emit output with context from only one direction. This helps detect trigger word as soon as it’s said. BRNN is likely to outperform unidirectional RNNs when it comes to language translation task where reference to words on either sides of the current word are important to emit a good output. Do cover the sections on transformer to get a better understanding of this concept.

It’s like asking 2 seperate questions:

  1. Is there a trigger word in [d1, d2,d3,d4,d5] ?
  2. Is there a trigger word in [d3,d4,d5,d6,d7] ?

Nope, it’s my understanding of your overlapping window

One detail to consider is that it’s possible to detect a trigger word overlapping boundaries of a window.

This would still detect trigger word signal start at d3 and end at d5, if the second sequence be like [d6,d7,d8…] then memory of trigger word happened would be lost since new sequence would reset the GRU status.

Bi-directional RNNs (BRNN) process inputs from both ends. Please read the following replies:

But this still didn’t explain how we can detect the trigger word almost immediately without waiting for 10 sec …
Assume there is a audio record of 10 sec activate+background and the activate happened at the 3rd second. It’s truth we can start to label it immediately after the trigger word detected, maybe at 3.01 sec, but the model still process the 10 sec record as a whole and the output from the model is a vector prediction of the whole sequence, thus we still need the model to finish processing the 10 sec audio record the tell when the activate happened. Unless we can make the model processing the sequence like a pipeline, just like while conv layer processing the 3rd second audio, the 2 GRU layers already start to process the 2nd second audio and the first second audio already fed into the dense+sigmoid layer, but I don’t think it’s how the model to process time sequence data.

I can’t find much details about pipelining across layers inside a compiled model. All we know is that if the model is run eagerly, there’s no pipelining across layers when we use Sequential since layers are called one after another.

You are correct about the wording in the assignment irrespective of the optimizations in the underlying model. 10 second worth of data needs to be fed into the model at a time for making a prediction. The staff have been notified to fix this.

1 Like