Which model and what dataset to be used for keyword spotting?

I am working on a project that will do audio classification. And since this will be running offline on mobile, my first choice is to use mobile net as model.

About preprocessing of audio, there are several choices. MFCC, spectogram, mel spectogram, and others.

About dataset, I have about 13000 audios of one word. The word is ‘Adele’, a German word.

Now, my goal is ‘1 vs all’.

I am confused how should I prepare my initial dataset. Do i need to have 13000 negative examples as well.
If no, how would I be able to classify if the trigger word is called or not?
if yes, what distribution should I consider for collecting the negative data? Possibilities are infinite.

Also suggest if there is any other model I should use.

1 Like

Ideally you would need a balanced dataset like 13000 words of negative cases, more realistically I think this is a scenario where the positive keyword occurs little in proportion of other words being spoken, so actually you would need more negative keywords rather than positive. There is a need also to user precision, recall and F1 scores for unbalanced datasets.

1 Like

Hey, your reply is appreciated. I have few concerns.

Correct me if I am wrong. If I go with unbalanced data (more negative examples), the model will be biased towards negative data and less likely to identify the trigger word.

Secondly, what would be the best distribution to collect negative data from? I believe silence, movies, podcast, etc… would work as negative data. Getting 2 seconds chunks from those big audios would be helpful.

1 Like

Yes that should be right that why you youse precision, recall and F1 in this case!

Whatever makes sense and is similar to the voice characteristics of your application in real life!

1 Like

Guide me here. What I know about this is that these are accuracy metrics. how can I use them other than calculating the end result of the model?

Do you mean to say that I should use these metrics to calculate my end results? and the high precision, recall and F1 are favorable, correct?

1 Like

Yes for unbalanced datasets these are better metrics than just plain accuracy! But this is a thought in principle, in the Deep Learning Specialization as far as I remember they have a Lab that they detect sound keywords, you should have a look at that.

1 Like

The Deep Learning Specialization (Course 5 - Sequence models) discusses identifying an activation keyword from an audio stream.

1 Like

the difference with that tutorial is that they are using 10 seconds audio. i have only 2 second audios. converting them to 10 seconds doesn’t seem to be a good idea. what do you suggest?

I guess you need to go through the Lab, understand the process and you could change the input time span as well as the input to model.


If I remember correctly the algorithm used in that lab will automatically 1) trim if the audio is over 10 seconds 2) pad if it is too short.

– At least in the ‘try it yourself’ part at the end.

This reminds me of a Kaggle competition I competed in called BirdClef-- for this I used the mel spectrogram and mobileNet V4, which worked well. This was also for much longer time spans, but I don’t think that matters.

Try augmenting your truth data – you can upsample or frequency shift the word or add noise, etc. to generate more truth data. I’m sure you can generate various accents as well with generative AI. A balanced negative data set would be good, it would probably be helpful to incorporate similar sounding words to ‘Adele’.

Sounds fun! Thanks for posting.