I have tried to really simplify the issue by:
-
Removing the time component all together.
I averaged the frequency spectrum over the whole window instead of storing STFT for each âtimestepâ.
Input vector is now (1,N,1025) where N is a total number of .wav files, hence x.shape[1] ranges around 160 instead of 4000-5000.
-
Simplified the problem to Overhead or No-overhead, so the target value is boolean (cymbal drum or not cymbal drum) hence, problem is now simplified to binary classification and would this work, I could run in parallel few different classifiers for few different drums.
Target value is now (1, N, 1), again - where N is a total number of .wav files.
-
Use feed-forward network
Since I am a rookie in the field of machine learning, when things donât work I like to simplify the problem to see where did it get stuck, and since I have basically removed the time dependancies by averaging time spectrum over the entire dutration of the wav clip, we could basically use feedforward nets to classify the drums as cymbal or not-cymbal. I wanted to make sure that I am not misusing the LSTM and also see if number of samples is sufficient and if spectral content is a good thing to look into in the first place.
I was hoping that with this approach I could get NN to atleast sometimes make a correct prediction, but once again, the prediction seems to be always stuck at the same output (same as it was with softmax output for the multiclass clasification, which was mostly stuck at the same output vector with minor variations, regardless of the X_input).
Looking at the problem in the human terms, the average spectrum of cymbal drum compared to the other classesis so vastly different and rich in all spectral compoments that it seems as sufficient data to tell them appart (esp from kick drum), without any complicated and convoluted logic required:
I used the following model structure, probably an overkill:
# Define the Feedforward model
model_FF = Sequential()
model_FF.add(Dense(1025,activation='sigmoid',input_shape=(None,X.shape[2])))
model_FF.add(Dropout(0.5)) # Adding dropout with a dropout rate of 0.5
model_FF.add(Dense(512,activation='sigmoid'))
model_FF.add(Dropout(0.5)) # Adding dropout with a dropout rate of 0.5
model_FF.add(Dense(1,activation='softmax'))
model_FF.summary()
# Early stop is used to make sure model is not overfitting. If 'val_loss' is not improved within 10 epochs (patience=10), the training is automaticlly stopped.
early = EarlyStopping(monitor='val_loss', min_delta=0, patience=25, verbose=1, mode='auto')
# Define the optimizer with a specific learning rate
optimizer = Adam(learning_rate=0.03) # Adjust the learning rate as needed
model_FF.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model_FF.fit(X, Y, validation_data=(X_val,Y_val), verbose=1, epochs=200 )
What I also find quite weird is that accuracy does not to seem to change at all:
Epoch 1/200
1/1 [==============================] - 2s 2s/step - loss: 0.7523 - accuracy: 0.2222 - val_loss: 10.9142 - val_accuracy: 0.5000
Epoch 2/200
1/1 [==============================] - 0s 164ms/step - loss: 4.8477 - accuracy: 0.2222 - val_loss: 9.4599 - val_accuracy: 0.5000
Epoch 3/200
1/1 [==============================] - 0s 151ms/step - loss: 4.1732 - accuracy: 0.2222 - val_loss: 6.9746 - val_accuracy: 0.5000
Epoch 4/200
1/1 [==============================] - 0s 124ms/step - loss: 3.0671 - accuracy: 0.2222 - val_loss: 4.0264 - val_accuracy: 0.5000
Epoch 5/200
1/1 [==============================] - 0s 75ms/step - loss: 1.8172 - accuracy: 0.2222 - val_loss: 0.9932 - val_accuracy: 0.5000
Epoch 6/200
1/1 [==============================] - 0s 83ms/step - loss: 0.5807 - accuracy: 0.2222 - val_loss: 2.1426 - val_accuracy: 0.5000
Any thoughs on this? What could I be doing wrong?
I wanted to try multiple approaches, think I will also try using Librosa to get a tabular audio characteristics also on the first few 100 ms of each file and use decision trees/XGBoost or some other classification algorithms to see if we can categorize drums (multiclass or bool) based on these few relevant features.