C5W3 - Missing intuition on positive dataset marking with trigger word detection

I realize that Prof. Ng brought up this method in lecture as well, but I’m not sure I quite understand the reasoning behind it.

I mean blind intuition would say we’d label the Y of our trainset during the time the activation or ‘trigger word’ is actually spoken…

But instead we don’t do that, we do it only for 50 steps after that.

This is not making sense to me-- Does it have something to do with the sort of ‘momentum’ in the RNN (here GRU) memory function ?

I mean what if a completely different word is spoken in the 50 steps after ?

Also, where does the ‘50’ come from ? Is this just arbitrarily chosen or ?

1 Like

Hey there @Nevermnd

This approach helps the model keep the influence of the trigger word over a short period (making it less sensitive to the exact position of the word).

I think, the choice of 50 steps is chosen based on the results to balance context capture without losing relevance.

This method improves the generalization of the model and the ability to handle noise by making sure that it learns from a larger temporal context.

Hope this helps!

1 Like

We are talking about actual recorded audio here, right? What is the frequency of the sampling? How much wall clock time is 50 cycles of that sampling rate? Maybe that corresponds to how long it takes to recognize the first phoneme of the word? Sorry but I am away from any computer right now and can’t check the sampling rate.

1 Like

Yes, recorded audio.

Sampling rate is 44.1 kHz, though this gets a little bit more complicated due to the transforms/down sampling that occurs in the model:

(Presuming) we are working in terms of output time steps, 50 steps over a 10 second clips divided into 1375 parts gives an ‘actual’ time of ~0.36 seconds, which seems a little on the short side.

I tried recording my own voice saying the trigger word used (‘attention’), and somewhere closer to .5 seconds seems more reasonable (perhaps even .47 seconds if I clip it super tightly).

1 Like

Again, I should stress, if anything, the time period of the trigger itself worries/confuses me less than the fact that we ‘activate’ the output only after the trigger word occurs-- not at all during its actual duration.

Its like knowing the ‘bulleye’ is our target, but instead we aim our arrow off to the side.

1 Like

Until all of the audio samples have been procesed, how would the algorithm reliably know that the word is “activate” rather than “action” or “Acton, Ohio” (for example)?

1 Like

@TMosh I have tried to add some notes to the slide to be more clear.

*This is seen in the lab too:

1 Like

Ok, I went back and listened to both of the videos in this section. He doesn’t really go into a lot of detail, but he does clearly make the point that we want to start the labels as 1 at the end of the trigger word. The point is that we don’t really care about the trigger word itself, other than the fact that it triggered us, right? It’s then whatever comes next that Siri or Alexa needs to understand and process. We don’t need the “Hey, Siri”, we need the “tell me what time it is” or “play the Beach Boys”.

At least that’s my take just listening to the lecture at 1.25 speed and not having gone back to see what additional details are in the assignment.

As always, it never hurts to just listen the lecture again.

1 Like

Oh no Paul,

I mean, I completely agree. This is exactly what he says in the lecture, and also what he has us do in lab (which I already got 100%), and after all it does seem to work.

However, it makes no logical sense to me, thus my question.

I mean consider the case where we make this ever so slightly more complicated: That is, now instead of one trigger word, we have two. Further, lets let these words be independent of one another (not part of a phrase). And for simplicity sake lets make them synonyms so that we don’t have to worry about having additional classes in our Y output.

For example lets say the two trigger words are ‘start’ and ‘go’.

If these both appear in a 10 second clip (so two triggers) and they are spaced far enough apart, yes, looking at the ‘audio after’ might still work.

However, if these words were said one after another in succession ‘go start’, suddenly we’d seem to have a big problem here as there is no ‘absent’ audio after the first trigger word, and worse, the audio that comes immediately after it is the second trigger word.

I mean I did have one really out there thought-- Were this the case of using a bi-directional RNN, well the time just after the end of the phrase, in a strange way, would also signal the beginning of the phrase in a sense-- Just run backwards/in reverse.

However, the GRU in the lab we implement is not bi-directional.

1 Like

I would just add, in trying to understand what is going on here better, I tried to run the ‘optional’ part of the assignment where you make/upload your own wave file.

At first I was trying to throw at it some really big challenges, but results were all over the place, so on like my fifth iteration I presented it with only two words.

First ‘black’, and then ‘activate’ otherwise comparative silence; This is pretty obvious to see in the spectrograph-- But something is really going wrong in the output function.

It doesn’t even seem to trigger when ‘activate’ is called (or maybe– but only a little), and even after, in utter silence, the probability of detection keeps rising…

Something’s not right here…

1 Like

I think you’re just “overthinking” this. If we have two trigger words, then either of them “triggers”, so if both words appear in sequence, the second trigger simply over-rides the first and you take whatever is after that as the actual command. If you have a phrase, then the trigger has two stages: the first word gets you to stage one and the second word only “triggers” if you’re already in stage one, although we’d probably need the sophistication of an expiry on the stage one.

If your experiments (which I didn’t listen to) didn’t work, then it’s probably just the usual issue that nothing we are doing here is actually realistic in the sense that we can’t afford to really train a valid model in the environment here. The resources are too constrained. Witness the very first “is there a cat” experiment in DLS C1 W4: it works terribly on your own pictures. 209 samples is a ludicrously small training set for a problem that complex.

1 Like

I will return to this topic tomorrow, but I always thought the ‘Is there a cat (?)’ assignment should have been replaced by this:

At least now I feel more than confident I can replicate that :smiley:.

It is more these side cases; And I get it, I mean I’ve been a university level teacher before… Many would like to just pass the assignment, yet rather than just ‘drink the Kool-Aid’, I am trying my best to understand which is why I ask.

L8r.

1 Like

I think now is the good time to revisit the lectures in C5W1. For example, just as the lecture title “Back propagation through time” suggested but thinking it reversely, RNN gives us a prediction with forward propagation through time.

In other words, each of the RNN’s prediction is an aggregation over time. The prediction is not just about the current time step, but also all previous time steps, even though ancient time steps’ contributions might be heavily discounted.

Therefore, it is a bad idea to mark the time steps in which the keyword is being spoken, because those time steps do not cover the whole keyword. It is the idea to mark the time steps afterwards, because any of them is an aggregation of the whole keyword.

As for how far (50 time steps? 100 time steps?) we should mark them, it is a hyperparameter that needs to be searched by comparing performances.

This alone, to me, is not an indication of problem.

To me, to have a meaningful discussion, we would better have three longer audio clips that

  • are long enough to cover also the descreasing probability
  • can compare the diff between the effects of the two words

In producing the bottom two clips, it would be best to keep the words’ locations unchanged so that you can compare the probability curves more intuitively.

These are just suggestions. I make suggestions, but the rest is up to you. :wink:

Cheers,
Raymond

1 Like

To improve my suggestion,

it is also important to keep a long silence before the first word, long enough that the probability becomes stablized.

This establishes a baseline.

Okay, I can see now why attention is important. An LSTM can be very good at remembering, but not very good at forgetting.

I also asked because I have a small FOSS project I came up with, something I can do that as a product oddly does not seem to exist yet-- another major company has already come up with this and they released the model open source.

But I am hoping to do something more general and specific and get it to run in ‘real-time’ on a Raspberry Pi 3 A+ (after quantization I’d guess).

I was kind of curious about some of their model design choices. Anyways, I wrote to one of the project leaders-- lets hope she gets back to me :crossed_fingers:.