AI cannot do well on a training set

Hey people, I am doing an audio ML project and would love your feedback on how to progress.
I am working on an AI that takes in Mel spectrograms of audio recorded by an old software that has the shape (128, 128, 1) and enhances its quality.
To make sure that the model is working, I am taking the same sound that is pulled from Youtube and passing it as both the x and y (input and output).
I have a deep model that is similar to the U-net architecture where there is an encoder that reduces the representation of the Mel spectrograms and a decoder that increases it to its original form.
From my understanding, the model should be able to almost perfectly reconstruct this audio if the model is properly defined but it seems like it is producing terrible results in that the audio isn’t even speech anymore.
I have been working on this for a couple of weeks and would love to hear your feedback.
Note: The loss function I am using is the SI-SDR added to 0.5 the MSE(I tried either alone and it was equally as bad)
What do you think is causing this? Would love to hear new perspectives!

1 Like