Trigger word detection assignment

Hello all,
I don’t fully understand how do we get to Tx=5511, Ty=1375 and how to pick parameters of CONV_1D like filter_size = 15, num_filters = 196
Could you explain it a bit or give some reference link, please?

The Tx and Ty numbers are loosely related to the audio sample rate of 44,100 samples/second, and the total number of samples in 10 seconds (441,000).
5511 is about 44100 / 8.
1375 is about 44100 / 32. This reduces the number of frequency bins in the spectrogram. We don’t need very small resolution in order to detect frequency patterns.

The filter size and number of filters are set by experiment. You want to reduce the complexity (by using a filter) and learn a lot of different patterns (196). but you don’t want so many patterns that training takes too long.