C5_W3_A2 Question about the architecture

In Trigger_word_detection_v2a, it is mentioned that we are using an RNN that is not bidirectional in order to be able to predict if there is a trigger word without the need to listen to the entire sequence.


However, the input fed into the convolution seems to be the whole spectrogram, meaning that we have to pass in the whole audio to process it. Did I misunderstand something ? Thank you

@orangefox I believe another part of the consideration is, figuring this is audio and we are trying to detect speech, a bi-directional RNN would be doing-- What ? – Reading the speech in reverse ?

Unless you are possessed by the Devil, I’m not sure that is useful.

Plus, personally, I believe that the thought here is normally you’d perform this sort of action on streaming audio, and add time steps as you go along.

However, setting up TF to process real-time streaming of audio is pretty complicated and requires Pyaudio or something else (as far as I am aware)-- So to do that goes beyond the scope of the course.

Thus here we process the whole audio clip at once for simplicity.

2 Likes

Thank you for your reply !

I understand what you mean. The way it was written made me think that using an uni-directional RNN here would allow us to make that real time processing, but then I saw that the whole input has to be fed into the convolutional step and that got me confused.

Bi-directional RNNs are meant to stabilize gradients when processing long sequences since the input gets processed from both directions. The only requirement when using a bi-directional RNN is that the entire input sequence should be available to the model.

1 Like

Isn’t that the case here ? We have the entire 10 seconds available that is fed into the first layer. Thank you.

@orangefox I’ve actually spent some time looking into real-time TF audio processing for a project I’ve been planning to work on. Unfortunately, at just this moment I can’t seem to find the relevant Github repo that seemed good and had achieved this task. But, yes you end up using a number of libraries outside just TF, and it is a little tricky.

I will update this post if I find it later.

Thanks !

@orangefox ok, so I found what I was looking at:

Note this all relates to (obviously) the inference, not the training step.

1 Like