Intuition about C3W1Q14

endrit.dosti · July 13, 2021, 2:18pm

I am not sure I agree to the correct answer for this question. I agree that because the distribution changes (we have a new bird species), we need to reconsider our metrics. However, I do not understand why that should be the first step. My initial intuition would be to try and augment those examples as much as possible (so that we have more data about the new species), add them into the dataset and reshuffle into a new train/dev/test split, and then define a new evaluation metric on the new dev/test set distributions.

Can you please help me understand why my intuition is incorrect in this case?
Thank you in advance,
Endrit

PS: I am sorry in case I might be giving away too much from the answer of the question. Feel free to delete any part that you consider inappropriate, or message me directly for a further discussion.

manifest · July 14, 2021, 7:24pm

Hey @endrit.dosti,

You may find an answer to your question in this thread.
Please feel welcome to ask any further questions.

endrit.dosti · July 14, 2021, 9:32pm

Hi Andrei,

Thank you for your reply and for suggesting the thread!

I think I now understand why it would not be a good idea to reshuffle the data into the dev/test set. However, I still cannot understand why my intuition about augmenting the dataset is not a good thing to try. I have also followed the videos of week 2, where Prof. Andrew argues about the cons of artificial data synthesis.

I guess I am not understanding when one should and should not try artificial data synthesis. In my opinion, since we have a very small amount of data, it makes sense to try and synthesize some examples (although they might come from a small set of all the possible examples) and include them in the training set. Then we can use a train-dev set to check how good our fit actually is. I would expect that approach to be better than no data at all. Can you please explain this a bit further, or if possible, suggest some material where I can read more on the matter.

Thank you in advance for your answer!
Endrit

manifest · July 15, 2021, 9:14am

In this case, the new species of bird won’t be presented in the set we evaluate on. In other words, the data distribution we evaluate on will be farther from the true data distribution. It will probably lead to the good results on evaluation and poor results on the real data.

endrit.dosti · July 15, 2021, 6:51pm

Thank you for the prompt reply!

I understand your point and I fully agree to it. However, I would guess that such an argument can be made for every case where one would attempt to utilize artificial data synthesis. I still cannot distinguish when it would be helpful to use it, and when it would not be helpful to use it. Can you please elaborate a bit further on this matter?

Best wishes,
Endrit

manifest · July 16, 2021, 5:46pm

I believe the general advice would be to start without it. It should be easy to add it later if needed. There are different types of augmentation, and you can get different results for different tasks and data. You need to experiment.

endrit.dosti · July 17, 2021, 10:13pm

I understand now. Thanks a lot for the information

Topic		Replies	Views
Confused about the right answer, week1 quiz Structuring Machine Learning Projects week-module-1 , coursera-platform	6	604	May 20, 2024
Conflicts in Course3 W1 quiz Q14 Structuring Machine Learning Projects coursera-platform	2	616	July 26, 2023
New 1000 images after model development (train/dev/test), where to add? Structuring Machine Learning Projects coursera-platform	12	717	July 5, 2023
C3-Week1Quiz wrong explaining Structuring Machine Learning Projects week-module-1 , coursera-platform	2	221	March 2, 2024
Question in the introduction video Structuring Machine Learning Projects coursera-platform	3	538	October 28, 2021

Intuition about C3W1Q14

Related topics