Are there any practical solutions to deal with data mismatch?
I have data from two sources but I have found out that I have a data mismatch problem with a bridge data accuracy of 95% and a dev accuracy of 78%.
I tried to mismatch the features of the dataset but it seems that the network is optimizing the mixed feature dataset.
Sorry, I’m not sure I understand everything you said there. What is the definition of “bridge data accuracy”? I’ve never heard that term before.
It would also help if you described in more detail how you handled the two separate sets of data. From the various examples I’ve seen Prof Ng discuss in the various DLS Courses (especially DLS C2 and C3), my first approach to a case like that would be to randomly shuffle the two datasets together and then subdivide the total into the three required datasets: training data, cross validation (“dev”) data and test data. The exact percentages you use for the subdivision depends on the total size of your dataset. Prof Ng discusses this in detail in DLS Course 2, if I remember correctly. The advantage of that approach is that the statistical distribution of all your datasets should be the same, containing the same ratio of the two different types of input data.
If that’s the approach you took and what you mean is that you got 95% training accuracy and 78% dev accuracy, that just sounds like a “plain vanilla” overfitting problem. Prof Ng describes how to approach that in DLS C2.
But if you handled the data in a different way, e.g. using one type for training and the other mismatched type for dev and test, then the situation is more complicated and we need more information. Prof Ng does discuss more complex situations like that in DLS C3 BTW.
I sincerely would want to know about your main dataset, then splitting of your dataset, features of dataset, how are you optimising your model, what kind of model you are trying to achieve. Kindly share these detail.
The other general thing to ask here is how much background and experience you have in ML and DL. Just looking at all the questions you’ve asked over the last couple of days, my impression is that your approach is that you have not yet taken any of the courses here, but you are hoping we can save you the trouble by basically “spoon feeding” you specific answers to particular questions that are covered in the courses. Another approach to consider would be to just take the DLS Specialization and learn directly from Prof Ng, who is the real expert here. Note that you can watch the lectures in “audit” mode, which doesn’t cost you anything. You get to learn the material from Prof Ng, but the only thing you miss out on in audit mode is the quizzes and programming assignments. But it sounds like you already have better (more “real”) projects to work on than the programming assignments, so you can probably get the full benefit just from watching the lectures and then applying them to your actual projects.
I have tried this implementation before and all three of the datasets have very good accuracy.
But I’d like to take it a step forward and take the approach of DLS Course 3
I have a big pool of data from Source1 (90K cases) and a small pool of data from Source2 (9K cases)
I have split 5K of data from S2 to my training, 2K for dev and 2K for test.
I mix the 5K data from S2 with 60K randomly selected data from S1. Therefore my dev and test come from the same distribution and my training set is different from them.
Because in real world implementation, my data source will come from source2 more, therefore I want to optimize my model towards Source2 with the similar datasets from source1.
Please tell me if the information is clear enough, thank you for your reply!
The project that I am trying to do is spectra identification on structural damage.
My main dataset comes from simulation data from computers, which has 90K examples.
My other dataset comes from real-life data from the actual structure, which has 9K examples.
The features of the dataset are just the spectrum on the bandwidth that I’m interested in from a single channel.
My model basically performs binary classification on damaged/undamaged data of a single channel. And I’m trying optimize it towards real-life data.
And I’m splitting my dataset like this to achieve the result that I want:
Thank you for reading this long message,
Yuhan Chiang
So 90k is simulation data from computer==Source 1
9k is actual structure from real life data==Source 2
Does source 1 and source 2 share similar features in aspect of dimensions, channels, classification?
Although randomly choosing and mixing source 1 and source 2 to your dataset can be right choice if structural similarity are present
But then you mention your training set is different from them? So basically more from real life data!!! So it is bound to happen the model is trying to optimise more with 9k examples mixed into your training and dev set, and giving you lesser accuracy
Can I see this code step??
Because the split of the source 2 was more in data accuracy(training set) than in dev(validation set) accuracy, you have got the expected accuracy more in training than in validation(dev) set
For this, we better have a look at your codes and dataset, to have better understanding and response.
Great! But then my next question would be “why is that not good enough?” If you are concerned about the performance specifically on the S2 data, you could evaluate the performance on the test data with that model trained and cross validated on the “mixed” dataset at a more fine grained level:
Test accuracy on the “mixed” data
Test accuracy specifically just on the S2 subset of the test data
Test accuracy just on the S1 subset of the test data
Or to maybe state the question in a slightly different way: what is your real goal here? What would be “good enough” performance to meet the requirements for your system? You can express that on any or all of the above three types of data.
Of course it’s a good experiment to try to get still better results based on the techniques Prof Ng shows in DLS C3 than you got with the mixed data model. It’s been a while since I took C3, but my memory is that he introduces some more subsets of the data. In particular I remember him creating a “training-dev” subset of the training data set. But I forget whether he does two “cross dev” steps based on the difference between the training-dev set (which has the same distribution as the training set) and the cross dev set, which has the distribution of the dev and test sets. It would probably be worth watching those lectures again.
Please share anything that you learn in these experiments!
In his book also, it seems that this is the better solution. It is stated from the video that it would be best if all of the tests come from the same distribution.
I am also thinking about reversing the problem and make my NN perform optimize the data from S2 instead of optimizing the majority of the data (Mixed S1 and S2) so that it would be able to identify the data that I want better.