Issues using RNN for drum sound classification

Amir_Pasagic · April 10, 2024, 2:57pm

Also, someone here said they got good results with different hyperparameters:

TMosh · April 10, 2024, 6:00pm

Why does it need to be a 3D tensor?

I’m a little unclear what your spectrograms consist of. Is the horizontal axis time, or frequency? Are they 2D or 3D data?

Amir_Pasagic · April 11, 2024, 2:19am

I thought all inputs to NNs in Keras have to be 3D tensor of dimensions (batch_size, sequence_length, no_features).

Horizontal axis is time, but rather a wav file number.
Each wav file results in (1,1025) spectrogram, and respresent the spectrum composition of the wav file, its no longer STFT for each timestep of wav file.
Eachwav file is only one entry along the x_axis and the y_axis is 1025, where 1025 is number of frequncy bins.

So for example, if there is 400 wav files, the input X_train will be 400x1025.
I reshape it then to (1,400,1025) as I thought Keras expects 3D inputs.

TMosh · April 11, 2024, 2:21am

Sorry, I still do not understand your training data set structure.

TMosh · April 11, 2024, 2:36am

TensorFlow data sets do not have to be 3D.

For example, if you were doing the classic “identification of handwritten digits” exercise, the training matrix would be size (5,000 x 400), where each example is a flattened 20x20 image gray-scale image, and there are 5000 examples in the training set (500 each of the ten digits 0 through 9).

Amir_Pasagic · April 11, 2024, 10:46am

Sorry, I am not so good at explaining this apparently.
Lets look at the code:

# FOR LOOP THAT ITTERATES OVER ALL AUDIO FILES
for x, file_path in enumerate(list_files):

      # Load audio files
      audio, sr = librosa.load(file_path, mono=True)

      # Turn audio code into a spectrogram. 
     # Returns multiple spectrograms for multiple consequitive time windows for each audio file
      stft = librosa.stft(audio, n_fft=N_FFT, hop_length=int(HOP_SIZE * sr))

      # Calculate magnitude in DB
      magnitude, phase = librosa.magphase(stft)
      magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)

     # Calculate mean value of spectrum over all time steps
     # This means, just calculate average content of frequency components for each wav file
     # This vertically stacks (1,1025) array to the X_train for each audio file
     # One could simply also just calculate fouirer transform of each wav file instead of STFT and then calculating mean value over time, I just did mean value as a quick workaround
     X_train = np.vstack((X_train,np.mean(magnitude_db.T,axis=0)))

      # For each audio clip add name of the folder to a label arrray 
      # This later converted to bool value based  on if overhead or not
      y_train.append(label)

I thought FF network also needs 3D input, but seems its just the case for RNNs and sequential data and it makes sense that this is so.

However as someone replied to me on StackExchange, it seems without early stopping and lower timestep solution converges, so apparently it can also take 3D array.

Amir_Pasagic · April 11, 2024, 12:47pm

It seems that feedforward NN gives good results now. Not only it is able to distinguish whether or not is cymbal, but also can classify all drums separately based on total frequency distribution with over 95% accuracy.

It makes me wonder why doesn’t RNN give good results, as it contains same information about the spectrum, plus an additional information about how the time spectrum changes with time (dynamical envelope), hence should be able to learn even more distinction between the drum sounds?

I must do something wrong, and in order to use RNNs in the future, I would like to understand if the way I input information is incorrect. Its quite frustrating not being able to make it work

Topic		Replies	Views
C5_W3_A2 Question about the architecture Sequence Models week-3	7	22	September 3, 2024
Which one is associated with the largest amount of successful deep learning applications: traditional ANN with many layers, RNN, CNN Neural Networks and Deep Learning	2	563	April 21, 2021
C5 W1 - Jazz Improv HW - What is the relationship between Exercise 2 and 3? Sequence Models	10	567	September 20, 2021
When to use Deep RNNs and intuition behind Deep RNNs? Sequence Models	4	611	May 18, 2021
DLS Week 1 Quiz - Q3, Q8 Neural Networks and Deep Learning	4	1018	October 24, 2022

Issues using RNN for drum sound classification

Related topics