C5_W3_A2 Question about the architecture

orangefox · September 2, 2024, 3:55pm

In Trigger_word_detection_v2a, it is mentioned that we are using an RNN that is not bidirectional in order to be able to predict if there is a trigger word without the need to listen to the entire sequence.

However, the input fed into the convolution seems to be the whole spectrogram, meaning that we have to pass in the whole audio to process it. Did I misunderstand something ? Thank you

Nevermnd · September 2, 2024, 4:00pm

@orangefox I believe another part of the consideration is, figuring this is audio and we are trying to detect speech, a bi-directional RNN would be doing-- What ? – Reading the speech in reverse ?

Unless you are possessed by the Devil, I’m not sure that is useful.

Plus, personally, I believe that the thought here is normally you’d perform this sort of action on streaming audio, and add time steps as you go along.

However, setting up TF to process real-time streaming of audio is pretty complicated and requires Pyaudio or something else (as far as I am aware)-- So to do that goes beyond the scope of the course.

Thus here we process the whole audio clip at once for simplicity.

orangefox · September 2, 2024, 4:15pm

Thank you for your reply !

I understand what you mean. The way it was written made me think that using an uni-directional RNN here would allow us to make that real time processing, but then I saw that the whole input has to be fed into the convolutional step and that got me confused.

balaji.ambresh · September 2, 2024, 4:15pm

Bi-directional RNNs are meant to stabilize gradients when processing long sequences since the input gets processed from both directions. The only requirement when using a bi-directional RNN is that the entire input sequence should be available to the model.

orangefox · September 2, 2024, 4:22pm

Isn’t that the case here ? We have the entire 10 seconds available that is fed into the first layer. Thank you.

Nevermnd · September 2, 2024, 4:25pm

@orangefox I’ve actually spent some time looking into real-time TF audio processing for a project I’ve been planning to work on. Unfortunately, at just this moment I can’t seem to find the relevant Github repo that seemed good and had achieved this task. But, yes you end up using a number of libraries outside just TF, and it is a little tricky.

I will update this post if I find it later.

orangefox · September 2, 2024, 4:26pm

Thanks !

Nevermnd · September 3, 2024, 8:49am

@orangefox ok, so I found what I was looking at:

Note this all relates to (obviously) the inference, not the training step.

Topic		Replies	Views
Question on trigger word detection Sequence Models coursera-platform	1	540	April 21, 2022
C5W3: Is real-time detection really possible? Sequence Models coursera-platform	3	568	September 12, 2021
Decoder Network: Bi-directional RNN over Beam Search Sequence Models coursera-platform	9	272	December 30, 2023
Real-time Trigger Word Detection Sequence Models coursera-platform	1	498	April 28, 2022
Bidirectional RNN - Time Step Outputs Sequence Models week-module-1 , coursera-platform	9	142	June 14, 2024

C5_W3_A2 Question about the architecture

Related topics