Near the end of the exercise, past all the required grading cells, there’s one part of the code that creates a chime sound over the audio whenever it detectes an “activation” word. The instruction says we want to insert a chime sound at most once every 75 steps to avoid having two sounds per one activation. But in the code it only counts “consecutive steps” up to 20 time steps. Why 20 here? Earlier in the exercise we coded output 1 for 50 time steps when there’s an activation. Doesn’t set a 20 time steps here we could potentially get 2 sounds per activation? I’m a little confused by this and appreciate any help here. Thanks.
Thanks for bringing this up. The staff have been notified regarding the markdown hinting that exactly 1 chime should be present in a 75 timestep output window.
Thanks. It just occured to me, that maybe the 20 consective steps in the given code is applied to the output, which has a different dimension than the input? Maybe it’s the conversion happening here?
I don’t follow you. Please elaborate keeping this markdown text in mind:
So we will insert a chime sound at most once every 75 output steps
The input and output to the CNN are of different dimensions. So for example, maybe because the input is of dimension (1,5511) then 75 output steps should be just 75 steps in the dimension, no problem. But the output from the CNN has a reduced dimension of 1375, so 75 steps in the 5511 input dimension is actually ~18 steps in the output dimension, which is close to the 20 steps in the code? This is my guess, not 100% sure.
That doesn’t sound like the correct interpretation. If your interpretation is correct, then we have to check for
consecutive_timesteps == 19 or 18 instead of
consecutive_timesteps > 20.
Per my understanding, 20 was chosen as a threshold via experimentation since it’s possible that the model doesn’t predict exactly 50 consecutive 1s after end of the trigger word. To reduce the number of chimes, one chime in a window of 75 output steps was used.
I’ve asked other mentors to comment on this topic as well.
Thanks. Let’s see what other mentors say about this.
To begin with, I think we all agree that there are 1375 output steps.
With the current version of the code, the number
20 carries two meanings:
A. threshold for identifying it as an activation signal and thus adding a chime
B. minimum time interval between two chimes.
We can set both of them to the same value (A = B), or have B > A. In the current version of the code, it uses A = B = 20.
If you experiment with the code by changing the number
75, no chime sound will be added to the first example immediately after, so
75 isn’t a good value for B (also supported by @balaji.ambresh’s comment) even if we wanted to cope with the description.
Therefore, we might keep A = B = 20, or change it to A = 20 and B = 75, and then make sure the text is consistent with the code.
The above is my comment, and since I didn’t make this part of the notebook, the course staff should have the final say of how to make what changes.
Your hypothesis would alter the meaning of output steps, however, it is a nice one because you won’t be able to propose any hypothesis if you have not thought about it, and I appreciate that
The model is trained to predict 50 consecutive 1s starting at end of trigger word. That would make checking for 75 consecutive 1s a poor choice. Thoughts? Could you please check the code on the ticket filed on the repo?
Both that solution and my replacing the description’s 75 with 20 can prevent from having double chime within a time interval. The difference is that that solution separates the time interval from the threshold to make them two configurable values (75 and 20 respectively), which is good.
Given that the model is expected to predict 50 consecutive 1s for a trigger word, wouldn’t there be multiple chimes for the same trigger word if the model is well trained and we check just for
consecutive_timesteps > 20?
I think that the value for A (and B, if configurable to be different from A) (defined in this post) are empricial. Setting A > 50/2 sounds reasonable but it would better be determined experimentally.
It’s like adjusting the threshold for a logistic regression model to trade off between precision and recall - it’s empricial.
PS: I have a feeling that your argument was going to support that solution in the ticket, and I totally agreed because having A = 20 and B = 75 will totally eliminate that possibility.