We are talking about actual recorded audio here, right? What is the frequency of the sampling? How much wall clock time is 50 cycles of that sampling rate? Maybe that corresponds to how long it takes to recognize the first phoneme of the word? Sorry but I am away from any computer right now and can’t check the sampling rate.
(Presuming) we are working in terms of output time steps, 50 steps over a 10 second clips divided into 1375 parts gives an ‘actual’ time of ~0.36 seconds, which seems a little on the short side.
I tried recording my own voice saying the trigger word used (‘attention’), and somewhere closer to .5 seconds seems more reasonable (perhaps even .47 seconds if I clip it super tightly).
Again, I should stress, if anything, the time period of the trigger itself worries/confuses me less than the fact that we ‘activate’ the output only after the trigger word occurs-- not at all during its actual duration.
Until all of the audio samples have been procesed, how would the algorithm reliably know that the word is “activate” rather than “action” or “Acton, Ohio” (for example)?
Ok, I went back and listened to both of the videos in this section. He doesn’t really go into a lot of detail, but he does clearly make the point that we want to start the labels as 1 at the end of the trigger word. The point is that we don’t really care about the trigger word itself, other than the fact that it triggered us, right? It’s then whatever comes next that Siri or Alexa needs to understand and process. We don’t need the “Hey, Siri”, we need the “tell me what time it is” or “play the Beach Boys”.
At least that’s my take just listening to the lecture at 1.25 speed and not having gone back to see what additional details are in the assignment.
As always, it never hurts to just listen the lecture again.
I mean, I completely agree. This is exactly what he says in the lecture, and also what he has us do in lab (which I already got 100%), and after all it does seem to work.
However, it makes no logical sense to me, thus my question.
I mean consider the case where we make this ever so slightly more complicated: That is, now instead of one trigger word, we have two. Further, lets let these words be independent of one another (not part of a phrase). And for simplicity sake lets make them synonyms so that we don’t have to worry about having additional classes in our Y output.
For example lets say the two trigger words are ‘start’ and ‘go’.
If these both appear in a 10 second clip (so two triggers) and they are spaced far enough apart, yes, looking at the ‘audio after’ might still work.
However, if these words were said one after another in succession ‘go start’, suddenly we’d seem to have a big problem here as there is no ‘absent’ audio after the first trigger word, and worse, the audio that comes immediately after it is the second trigger word.
I mean I did have one really out there thought-- Were this the case of using a bi-directional RNN, well the time just after the end of the phrase, in a strange way, would also signal the beginning of the phrase in a sense-- Just run backwards/in reverse.
However, the GRU in the lab we implement is not bi-directional.
I would just add, in trying to understand what is going on here better, I tried to run the ‘optional’ part of the assignment where you make/upload your own wave file.
At first I was trying to throw at it some really big challenges, but results were all over the place, so on like my fifth iteration I presented it with only two words.
First ‘black’, and then ‘activate’ otherwise comparative silence; This is pretty obvious to see in the spectrograph-- But something is really going wrong in the output function.
It doesn’t even seem to trigger when ‘activate’ is called (or maybe– but only a little), and even after, in utter silence, the probability of detection keeps rising…
I think you’re just “overthinking” this. If we have two trigger words, then either of them “triggers”, so if both words appear in sequence, the second trigger simply over-rides the first and you take whatever is after that as the actual command. If you have a phrase, then the trigger has two stages: the first word gets you to stage one and the second word only “triggers” if you’re already in stage one, although we’d probably need the sophistication of an expiry on the stage one.
If your experiments (which I didn’t listen to) didn’t work, then it’s probably just the usual issue that nothing we are doing here is actually realistic in the sense that we can’t afford to really train a valid model in the environment here. The resources are too constrained. Witness the very first “is there a cat” experiment in DLS C1 W4: it works terribly on your own pictures. 209 samples is a ludicrously small training set for a problem that complex.
I will return to this topic tomorrow, but I always thought the ‘Is there a cat (?)’ assignment should have been replaced by this:
At least now I feel more than confident I can replicate that .
It is more these side cases; And I get it, I mean I’ve been a university level teacher before… Many would like to just pass the assignment, yet rather than just ‘drink the Kool-Aid’, I am trying my best to understand which is why I ask.
I think now is the good time to revisit the lectures in C5W1. For example, just as the lecture title “Back propagation through time” suggested but thinking it reversely, RNN gives us a prediction with forward propagation through time.
In other words, each of the RNN’s prediction is an aggregation over time. The prediction is not just about the current time step, but also all previous time steps, even though ancient time steps’ contributions might be heavily discounted.
Therefore, it is a bad idea to mark the time steps in which the keyword is being spoken, because those time steps do not cover the whole keyword. It is the idea to mark the time steps afterwards, because any of them is an aggregation of the whole keyword.
As for how far (50 time steps? 100 time steps?) we should mark them, it is a hyperparameter that needs to be searched by comparing performances.
This alone, to me, is not an indication of problem.
To me, to have a meaningful discussion, we would better have three longer audio clips that
are long enough to cover also the descreasing probability
can compare the diff between the effects of the two words
In producing the bottom two clips, it would be best to keep the words’ locations unchanged so that you can compare the probability curves more intuitively.
These are just suggestions. I make suggestions, but the rest is up to you.
Okay, I can see now why attention is important. An LSTM can be very good at remembering, but not very good at forgetting.
I also asked because I have a small FOSS project I came up with, something I can do that as a product oddly does not seem to exist yet-- another major company has already come up with this and they released the model open source.
But I am hoping to do something more general and specific and get it to run in ‘real-time’ on a Raspberry Pi 3 A+ (after quantization I’d guess).
I was kind of curious about some of their model design choices. Anyways, I wrote to one of the project leaders-- lets hope she gets back to me .