Issues using RNN for drum sound classification

I am trying to use RNNs for classification of sounds (specifically drum categorization) as I described in the following post on Stack Exchange in more detail:

It is inspired in the referenced article (in the post above) and I know RNN in freq domain is not the best option, and convolutional networks would be a better choice, however I want to do both approaches for learning purposes.

I do not expect fantastic results, but I expect it to differ vastly different categories, for example Overhead and Kick Drum.

1 Like

From the linked paper…

“Target value is derived from the folder name meaning [‘kick’, ‘snare’, ‘tom’, ‘overhead’ …] and then later turned to one-hot representation …”

my emphasis added

But drum kit sounds are not mutually exclusive temporally, right? <unless played by a person like me who doesn’t have good limb independence>. Drum transcription seems like a pretty tough job.

Here’s another thread where different approaches are discussed…

I have no personal experience working with audio signals so can’t address the architecture question. It’s bad enough for me listening to really talented drummers, seeing the transcription would only make the gap more obvious :joy:

2 Likes

My naive first thought:
To make transcriptions of real music, you may have to train a lot of RNN logistic detectors, one for each labeled sound type. Then to make a transcription, run them all in parallel from the same sound source, to detect all of the things which may be present simultaneously in a complete music sample.

3 Likes

Agree it seems difficult. For the drum kit, some parts have a very quick attenuation (snare or kick drum) while others can have considerable sustain after percussive strikes (ride cymbal). Untangling that into a performance-to-notation Is not likely to be easy.

@Amir_Pasagic I’ve already got a couple of Tony Williams and Billy Cobham recordings in mind so when it’s ready for beta test let me know!

2 Likes

Thanks for your answer so far, and thank you for the linked article!
I will give it a read now.

I formatted the training data in a way that there is only audio from one wav file at the time, optionally followed by a period of silence, then followed by the next audio file.

I know this is not how music works in real life, but I wanted to get a “Hello world” project running with the basic setup to see if I get how different version of RNNs work and what are the input formats for different RNN versions etc. When you start from the scratch, I believe its good to go step by step and make sure data I work with is sufficiently versatile, general structure of the model and number layers/units is good enough for this type of the problem etc :slight_smile: A proof of concept so to speak.

I was thinking to even start with a format in which entire wav is input at the time and (correct me if I am wrong but this would be a many-to-one vs many-to-many representations), in which case I would input a tensor with batch size equal to total number of wav files where each sequence from the batch STFT spectrogram of the size sequence_length x number_freq_bins.

Currently I just input all the wav file spectograms stacked together and in each timestep a label is given.

Actually since we want to know only when the drum is triggered, data could even be possibly reduced by analyzing only first e.g. 200-500 ms as I am not interested in the entire duration of the wav. (most dominant dynamics occurs in the beginning)

You are right. Perhaps it is easier to turn it into a bunch of binary classification problems instead of one multi-class problem where classes can occur at the same time where we try to extract it all from a single spectrogram.

Regarding my simplified problem, I thought it shouldn’t be as complex to get it to run very basic distinction since e.g. a Cymbal and Kick have clearly a very different spectral content (without a need to even look into its time dynamic) and even a feed-forward network seems like it would in theory be capable of making a distinction better then my current LSTM setup, so I was assuming I am doing something wrong, but not sure how to approach debugging.

One thing that I thought way be interesting is that there is bunch of MIDI files available, out of which one can easily generate music from, which is rich in all sort of spectral components while also having easily derived labels for e.g. Kick drum. That way we can have many examples of Kick drums occuring simulatneously with other instruments hence a rich enough data set. Bascially it is easy to self-generate labeled training data and do various attenuation, and then apply number of binary classifications of type “kick-detected”, “hi-hat detected” etc.

It was just a thought, not sure if it makes sense.

Your idea of farming some MIDI files to get your training data seems useful.

1 Like

There’s nothing novel or particularly difficult about using spectrogram images to identify different sounds. It’s essentially the classic ‘cat or not-cat’ problem from ML image classification.

You’re aiming for something much higher, so I recommend you start on that rather than oversimplifying the scope just to get something working.

1 Like

Yes, you are right. I guess I should, as written above, consider just the so called Attack envelope when it comes to percusions to detect just the first 200-300 ms of the drum and label only the first window as “Kick” and rest as “None”.

Also, deal - you will be the first one to know when its up and running to test it out :slight_smile:

I tried to learn RNNs as I just finished the course on them on top of general ML specialization and I already had issues with them regarding some other implementations, so I wanted to apply RNNs on something that seemed interesting, tho I get it is by far not the best choice of a model for this problem.

As I said - it was mostly for educational purposes to get whats happening inside.

Thank you for your replies, in anycase.

Also, sorry for a lot of replies, but wanted to point out, I followed the approach from this article mainly, and want to figure how did it work out for them:

I don’t think there is enough information in the article to replicate their method.

1 Like

I am happy to hear that it is not just my impression. Thank you for reading it.

I was curious to see why does their method work and where do I fail.
I would like to see if my understanding of data structures in RNN make sense and that the problem isnt there or the choice of model architecture.

Will try your suggested approach with binary classifying something as Kick (true) or No Kick (false) for start.

Just one more question - if I wish to share a notebook together with a data that could be ran by the other users, what are the prefered options for the memebers here? My Collab notebook reads directly from My Drive so its not usable by others if they wish to play around.

Github is very popular.

1 Like

I have tried to really simplify the issue by:

  1. Removing the time component all together.
    I averaged the frequency spectrum over the whole window instead of storing STFT for each “timestep”.

    Input vector is now (1,N,1025) where N is a total number of .wav files, hence x.shape[1] ranges around 160 instead of 4000-5000.

  2. Simplified the problem to Overhead or No-overhead, so the target value is boolean (cymbal drum or not cymbal drum) hence, problem is now simplified to binary classification and would this work, I could run in parallel few different classifiers for few different drums.

    Target value is now (1, N, 1), again - where N is a total number of .wav files.

  3. Use feed-forward network

    Since I am a rookie in the field of machine learning, when things don’t work I like to simplify the problem to see where did it get stuck, and since I have basically removed the time dependancies by averaging time spectrum over the entire dutration of the wav clip, we could basically use feedforward nets to classify the drums as cymbal or not-cymbal. I wanted to make sure that I am not misusing the LSTM and also see if number of samples is sufficient and if spectral content is a good thing to look into in the first place.

    I was hoping that with this approach I could get NN to atleast sometimes make a correct prediction, but once again, the prediction seems to be always stuck at the same output (same as it was with softmax output for the multiclass clasification, which was mostly stuck at the same output vector with minor variations, regardless of the X_input).

Looking at the problem in the human terms, the average spectrum of cymbal drum compared to the other classesis so vastly different and rich in all spectral compoments that it seems as sufficient data to tell them appart (esp from kick drum), without any complicated and convoluted logic required:

I used the following model structure, probably an overkill:

# Define the Feedforward model

model_FF = Sequential()

model_FF.add(Dense(1025,activation='sigmoid',input_shape=(None,X.shape[2])))

model_FF.add(Dropout(0.5)) # Adding dropout with a dropout rate of 0.5

model_FF.add(Dense(512,activation='sigmoid'))

model_FF.add(Dropout(0.5)) # Adding dropout with a dropout rate of 0.5

model_FF.add(Dense(1,activation='softmax'))

model_FF.summary()

# Early stop is used to make sure model is not overfitting. If 'val_loss' is not improved within 10 epochs (patience=10), the training is automaticlly stopped.

early = EarlyStopping(monitor='val_loss', min_delta=0, patience=25, verbose=1, mode='auto')

# Define the optimizer with a specific learning rate

optimizer = Adam(learning_rate=0.03) # Adjust the learning rate as needed

model_FF.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model_FF.fit(X, Y, validation_data=(X_val,Y_val), verbose=1, epochs=200 )

What I also find quite weird is that accuracy does not to seem to change at all:

Epoch 1/200
1/1 [==============================] - 2s 2s/step - loss: 0.7523 - accuracy: 0.2222 - val_loss: 10.9142 - val_accuracy: 0.5000
Epoch 2/200
1/1 [==============================] - 0s 164ms/step - loss: 4.8477 - accuracy: 0.2222 - val_loss: 9.4599 - val_accuracy: 0.5000
Epoch 3/200
1/1 [==============================] - 0s 151ms/step - loss: 4.1732 - accuracy: 0.2222 - val_loss: 6.9746 - val_accuracy: 0.5000
Epoch 4/200
1/1 [==============================] - 0s 124ms/step - loss: 3.0671 - accuracy: 0.2222 - val_loss: 4.0264 - val_accuracy: 0.5000
Epoch 5/200
1/1 [==============================] - 0s 75ms/step - loss: 1.8172 - accuracy: 0.2222 - val_loss: 0.9932 - val_accuracy: 0.5000
Epoch 6/200
1/1 [==============================] - 0s 83ms/step - loss: 0.5807 - accuracy: 0.2222 - val_loss: 2.1426 - val_accuracy: 0.5000

Any thoughs on this? What could I be doing wrong?

I wanted to try multiple approaches, think I will also try using Librosa to get a tabular audio characteristics also on the first few 100 ms of each file and use decision trees/XGBoost or some other classification algorithms to see if we can categorize drums (multiclass or bool) based on these few relevant features.

If you’re training a straightforward NN binary classifier, you need examples that are both True and False.

Please post an image showing one representative example of each class.

After that, all of your “True” examples must be statistically similar. Same goes for all of your “False” examples.

So if your examples each represent a time span, and there are variations in the time domain, that could be a challenge.

1 Like

Concatenated spectrograms above show that.

The spectrally bright stripes represent overheads that are rich in all frequencies.
All other drums are significantly darker stripes.

Subplot below indicates 1 (true) if it is overhead and 0 if it isn’t.

There are some variations ofcourse as with clasifying anything in real life, no two examples are identical, but the difference is quite striking, as you can see in the bright and dark stripe in the image.

I should have said “… you need examples that are either True or False, not both at the same time”.

If you’re using a NN that uses the entire image as one training example, then each image can contain only one label.

Having a image that has multiple 0 and 1 flags isn’t going to work.

It appears you’re using examples that are appropriate for a sequence model. But you’re not using a sequence model. You’re using a fully-connected deep NN. It looks a the entire image and comes up with one prediction for the entire image.

The architecture you’re using is like identifying images of cats vs. images of dogs dogs. Each image has only one label.

For a fully-connected NN to work with your data set, you’d need to preprocess it to cut each spectrogram into individual identical length segments, where each segment either does or doesn’t contain one individual a cymbal hit.

Hm, it shouldnt be true, but perhaps I am again misunderstanding.

Which dimensions should the input be?
Lets say I have 400 wav files and they result in 400 different spectrograms (each is 1x1025 numpy array), resulting in 400x1025 numpy array.

I reshape it to 3D tensor, which is of shape (1,400,1025) to make a suitable input.

Each wav file has 1 label True or False asigned, resulting in (1,400,1) target var array.

Is this sensible or should input/output be structured differently for non-sequence problem.

Thank you again for guiding me patiently thru this, input/output data formats can be very confusing for a beginner. I will be sure to do the same for others once I become more experienced :slight_smile: