Issues using RNN for drum sound classification

Also, someone here said they got good results with different hyperparameters:

Why does it need to be a 3D tensor?

I’m a little unclear what your spectrograms consist of. Is the horizontal axis time, or frequency? Are they 2D or 3D data?

I thought all inputs to NNs in Keras have to be 3D tensor of dimensions (batch_size, sequence_length, no_features).

Horizontal axis is time, but rather a wav file number.
Each wav file results in (1,1025) spectrogram, and respresent the spectrum composition of the wav file, its no longer STFT for each timestep of wav file.
Eachwav file is only one entry along the x_axis and the y_axis is 1025, where 1025 is number of frequncy bins.

So for example, if there is 400 wav files, the input X_train will be 400x1025.
I reshape it then to (1,400,1025) as I thought Keras expects 3D inputs.

Sorry, I still do not understand your training data set structure.

TensorFlow data sets do not have to be 3D.

For example, if you were doing the classic ā€œidentification of handwritten digitsā€ exercise, the training matrix would be size (5,000 x 400), where each example is a flattened 20x20 image gray-scale image, and there are 5000 examples in the training set (500 each of the ten digits 0 through 9).

Sorry, I am not so good at explaining this apparently.
Lets look at the code:

# FOR LOOP THAT ITTERATES OVER ALL AUDIO FILES
for x, file_path in enumerate(list_files):

      # Load audio files
      audio, sr = librosa.load(file_path, mono=True)

      # Turn audio code into a spectrogram. 
     # Returns multiple spectrograms for multiple consequitive time windows for each audio file
      stft = librosa.stft(audio, n_fft=N_FFT, hop_length=int(HOP_SIZE * sr))

      # Calculate magnitude in DB
      magnitude, phase = librosa.magphase(stft)
      magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)

     # Calculate mean value of spectrum over all time steps
     # This means, just calculate average content of frequency components for each wav file
     # This vertically stacks (1,1025) array to the X_train for each audio file
     # One could simply also just calculate fouirer transform of each wav file instead of STFT and then calculating mean value over time, I just did mean value as a quick workaround
     X_train = np.vstack((X_train,np.mean(magnitude_db.T,axis=0)))

      # For each audio clip add name of the folder to a label arrray 
      # This later converted to bool value based  on if overhead or not
      y_train.append(label)

I thought FF network also needs 3D input, but seems its just the case for RNNs and sequential data and it makes sense that this is so.

However as someone replied to me on StackExchange, it seems without early stopping and lower timestep solution converges, so apparently it can also take 3D array.

It seems that feedforward NN gives good results now. Not only it is able to distinguish whether or not is cymbal, but also can classify all drums separately based on total frequency distribution with over 95% accuracy.

It makes me wonder why doesn’t RNN give good results, as it contains same information about the spectrum, plus an additional information about how the time spectrum changes with time (dynamical envelope), hence should be able to learn even more distinction between the drum sounds?

I must do something wrong, and in order to use RNNs in the future, I would like to understand if the way I input information is incorrect. Its quite frustrating not being able to make it work :slight_smile: