If the output is an audio file, how should it be processed?

I’m trying to get an audio file with two numeric inputs (A) and an output (B).
(I’m not trying to make a TTS.)
I plan to use only 2 or 3 neural networks.

When trying to do machine learning, how should the audio file be processed and presented? Can I just put the file in?

Note: I completed the course https://www.coursera.org/learn/machine-learning-course/ on Coursera.

Hi @Golden_Hwang ,

Welcome to the community! This is your first post :slight_smile:

Regarding your question: “how should the audio file be processed and presented? Can I just put the file in?”

I am a beginner in audio, so all I can do right now is give you some general ideas of what you are trying to do.

First, we need to know that the audio files cannot be input directly into the model.

Just as with images, the audio files need to go through a series of transformations before they are given to the model.

If you think of the images case, we take the image, reshape it to a vector with the pixel values, then normalize the vector values and may be crop the vector to get a fixed size, etc.

With audio we have to follow a similar process of transformations:

  • We read the audio file and reshape it to have a standard shape, like 1 audio channel or 2 audio channels.
  • Then we standardize the sampling rate to have the same rate of Hz. This means for instance to have the same array size per second of audio.
  • Next, resize each array so that all arrays have the same lenght, by either truncating or padding with zeroes.
  • Finally we create a spectrogram to capture the main features of the audio
  • Before and/or after the spectrogram, we can do some data augmentation

At this point we have an audio file transformed and ready to be fed to the model.

So this is, in general terms, how you’d prepare an audio file to be used by a model.

I hope this gets you started. As I learn more about this specific topic, I can be sharing more about it.

Good luck in your project! Please share any findings!


Thank you for your kind and detailed reply.

Are there any Coursera courses where I can practice on this?
The process seems more complicated than I thought…

@Golden_Hwang ,

Yes, I was also surprised about the requirements to process audio. In fact, even now that I know a bit more, I still struggle with my specific use case.

I could not find this information in Coursera. One of the best sources of information I have found is this one:


You’ll find a step-by-step guide right there.

I hope this helps :slight_smile:


Thank you so much again. :laughing:

1 Like