hi, I have tired a couple of times on this particular question of week 1 Bird Recognition in the City of Peacetopia Quiz
week 1, Quiz Problem: Over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data. There are only 1,000 images of the new species. The city expects a better system from you within the next 3 months. Which of these should you do first?
on my first try, I got it wrong, the hint provided is that âNo. The First youâll need more data so augmenting the existing data to create more training examples would be the next stepâ ,
Then I did the quiz again, and I got wrong again, this time I choose âtry augmentation/data synthesis to get more images of the new type of birdâ, the hint said " The true data distribution is changed. It means you need to adjust your evaluation. Because you evaluate your learning algorithm on dev and test sets, adding more data only to the training set doesnât help the algorithm to perform better."
So, looks to me that the first hint is to suggest âaugment your data to increase the imageâ , but the second hint is to redefine a evaluation, I am confused âŚ, so both of them are correct then?
@sxl269 I vaguely remember seeing this questionâ
I went back to see the results I got, but unfortunately this question is not in the set of the version I passed;
Can you refresh my memory is this a âselect all that applyâ question or âselect only one ?â.
From the question alone though, yes, the obvious problem is our test distribution has changed from the one weâve trained on-- It is something weâve never seen before so the model is not good at predicting it thus:
Hint 1) 1,000 new images is a very small amount compared to the original dataset (I think it was like 100,000 or a million or something), and obviously there are enough of these new birds that it is causing issues in detection (i.e. there are probably way more than a thousand new birds hanging around the city so any additional data, or in lieu of that data augmentation (i.e. synthetic data)) will help rebalance the distribution.
Hint 2) I think the key to what it is saying here/pointing to is the fact that when you augment/add data, it matters alot where you add that data to. And it should be equal over all parts-- Or the new images donât go only into the training set, but also in the same measure into your dev and test sets.
This ensures you are fully matching the new distribution âacross the boardâ.
I mean, think about it-- If you train your model on a new train set with the new bird-- But it then never sees any examples of this new bird when you go to evaluate it on the test set, how do you have any indication it is any better ?
Hope this helps at least a little⌠I know you canât post the full question (you could PM it me privately though), but I am working from memory here;
1 Like
Hi @sxl269
You didnât mention what answer you selected for your first try?
Based on the two hint provided it clearly mentions two basic points.
- You require more data similar with the existing data, so basically having same data distribution in relation to the new species of images added. but here the criteria is not only about adding more data related to new species, but to have an dev/test set related to the new species which is causing the system performance to slowly degrade.
- The second hint clearly mentions the right statement as the data distribution will change because there are 1000 images of new species of birds added and not just image of same species, the whole criteria of evaluation needs to make sure the dev/test set consider this new species of data.
in such case,
So I am assuming you must have chosen your first option as Put the 1000 images into the training set so as to try to do better on these birds.
So I deal when you have got a new set of data, how would you include in your system for better performance on this new species of data?
You would require to redefine your dataset based on the new species of data you have got to get a new dev/test set!! So the data distribution for the dev/test set doesnât fail your system on the new species of data.
Regards
DP
Thanks for the reply, I understand all your reasoning behind, but my confusion is that why we cant choose both
âUse data you have to define a new evaluation metric(using a new deve/test set) âŚâ and also âtry data augmentation/data synthesis to get more images of the new type of bird.â
To me , those two hints are referring to these two choices.
@sxl269 Well⌠I mean part of the question is it explicitly asks you which one you would do first; Or it is perhaps trying to suggest, âwhich one is the most important/would provide you with the best up front resultsâ.
Hi @sxl269
As the question clearly states what you do first, addition of any new type of data, would require you to redefine you data based on the new data added.
Data augmentation or data synthesis would come once the data is defined based on the older data plus the new species of data has been defined based on their features which could be like for examples white pigeons or asian pigeons, then splitting the data into dev or test set and then further if the training set seems to have less number of these new species of data, then data augmentation would come into picture.
So the first step would be evaluating if the new species of data how holds significance with the present system of dev or test set
Hope it helps
Regards
DP
alright, that makes sense now
1 Like