Course 1- week 3 - label consistency: unintelligible tag

Question1: effect of the use of ‘unintelligible’ tag on the model
For the hard-to-recognize sounds, why would we choose to use the ‘unintelligible’ tag instead of removing the example from the dataset? Is it because doing so would enable the model to return ‘unintelligible’ words when it faces similar unrecognizable sounds?

Question2: data point or noise?
From one of the previous video, the instructor said that consistent and clean data are important, especially for small dataset. So my questions are:

  • do we treat the hard-to-recognize recordings as noise?
  • suppose that we can only have a few labels, would it be a better to remove the recording?

Q1 : The idea is to deal with cases where the model cannot clearly classify/detect. These can be dealt manually and add correct labels so in future the machine can improve its model.

Q2:When you have a large dataset, you afford to have few mistakes. It is not the case with smaller dataset, you want to have near perfect labels for your data. 1 incorrect of 100 is better than 1 incorrect of 10. It is a bigger percentage in a smaller dataset.