Adding car noise to the voice recognition dataset

Hello everyone,

This is a practical question, not a theoretical one

In the course, Mr. Ng explains that you can combine clean audio input with car noise to have more realistic input of voice recognition in a car.

This made me wonder about the generalization abilities of deep learning algorithm.
Surely, if you are in car from the 70s, a recent and expensive car and an electric car, the audio input captured by the mic of the surrounding noise will be very different. Actually, even the type of microphone will generate vastly different input. The audio input captured by a bad mic will probably be as easy to comprehend for a human as the input from a good one (because mics are designed to the human ears). However, for an AI, the two audio inputs might be completely dissimilar.

Therefore, I have two questions:

  1. How well can an AI trained with only surrounding songs from a 70s car with a bad mic can generalize the learning to a good mic in an electric car? And vice versa?
  2. If it sounds similar to the human, does it mean that it will also sound similar to the AI? Or in the opposite, one should be careful of thinking that things are similar when they are actually only similar for us and not an AI?

Best regards

These are my thoughts as a fellow learner, so take them with a grain of salt.

I have no practical experience with DL-based audio/speech recognition, so I can’t attest to the generalization capabilities of actual, modern DL audio recognition systems. However, the course makes me think that it doesn’t matter how well (or not) it initially generalizes as long as you have concrete directions for how to improve the model.

Regarding the second question, the course makes it quite clear that you should not assume that human perception of similarity will mean the AI also perceives similarity. Dr. Ng uses the example of 20 different car models in a video game appearing acceptable to us but not reflecting the real world. This seems to be an issue one should think about particularly when using data synthesis/augmentation.

1 Like