When training a neural network, we typically use training, development (dev), and test sets. We’ve identified some mislabeled data in these sets, and during training, we found that the proportion of mislabeled data in the training set was higher than in the dev and test sets. After correcting the mislabeled data during error analysis and retraining the model, I’m concerned that the training set might overshadow the improvements made to the dev set. Specifically, could the issues in the training set impact the model’s performance on the dev and test sets, and will the model’s ability to generalize be compromised?
What do you mean?
The training set is where the model learns from and if it learns wrong, the dev and test sets will get the wrong right!
How much does it affect it, it depends on the proportion of mislabeled and the criticality of your application.
Yes, I’ve encountered in deep learning and neural network courses that correcting mislabeled data in the development (dev) and test sets, and then retraining the model, can sometimes be effective. However, this approach doesn’t always yield the desired results.
After conducting some research and consulting with colleagues, I’ve found a potential solution: collecting additional data related to the specific mislabeled examples and incorporating this new data into the dev set before retraining the model. This approach might address the issue effectively but could also be costly.
I’m interested in exploring alternative methods as well, given the potential expense of this solution. If you have any other suggestions or strategies for dealing with mislabeled data without incurring high costs, I would greatly appreciate your input.
Fixing mislabeled examples is just part of the data cleaning process.
It’s not really a Machine Learning issue.