Recently I have been working on a project that uses timeseries analysis and the data is collected from a sensor. Now I am trying to model it using approaches that prevent data leakage and prevent the model from looking at the future before making a prediction, Now the problem that I am undergoing is that I am using overlapping windows with my data and what I am, Scaling the data then creating these windows and then finally splitting these sequences into train and test and the feeding the model. This is giving me 100% accuracy on the test set which is to be very honest hard to digest. I think the model is somehow looking at the data test data before hand is hence able to predict perfectly. And by prediction I mean classifying the data into 2 classes anomalous or normal. I would really appreciate any input on this from the community.
Are you using a Sequence Model? This technique is covered in Course 5 of the Deep Learning Specialization.
Hand-crafting your own windowing method is not a “machine learning” sort of solution.
AS far as I remember, timeseries is also introduced in the Tensorflow Developer Certificate, one of the courses, maybe it helps you!
Points further to explain.
what kind of dataset you are working as you mention it is a classification based time series analysis, you need to check class distribution, if balanced(then no problem), then check how you are splitting data.
You mentioned your test data is giving 100% accuracy, explain your model architecture, your window period, your overlapping windows.
Before that no one can predict anything about what you have provided information.
Also share some screenshot of your work without sharing codes (for your privacy concern), just the test results, how data class distribution graph is, have you include mean and standard deviation in your data distribution etc.
Regards
DP
Hello there thankyou for your queries I’ll address them one by one below
- There is definitely class imbalance in the dataset as I have 2 classes of normal data and around 15 classes of data having an anomaly, but I even tried classification in 17 different classes but I still got an accuracy of around 100% only.
- Next up the model architecture is something I can not talk about too much but it is a parallel architecture and it window sequence_length is 128 and with a stride length of 32 due to hardware limitations
You are still giving a vague description, I asked you what kind of data are you working?
one thing I understand you have a data where there 17 classes but you are creating a binary classification for only 2 classes which is time dependent.
Sorry until I don’t know about model architecture, i cannot reply or help more efficiently.
Parallel architecture? please post screenshot of your results(I don’t need to see your codes)
even your test results, model training output provides much better information.
you also mentioned your data is imbalanced but didn’t explain in class distribution? are you saying in the two classes of binary classification or the two classes which are using are more in numbers?
you can create a class distribution graph one for the binary class and another graph with 17 classes, even that gives a lot of look into your data distribution.
You also didn’t mention your whole dataset in numericals? as that is most important point.
my vague understanding till what you mentioned is probably you are working on a much smaller dataset out of which you selected two more evident class and probably that’s you are getting 100% accuracy
sorry until you cannot share the result screenshot, that’s all I can say.
could you send the link for the course please
Please see the public repo. for time series notebooks covered as part of tensorflow developer certificate specialization.
The final course on deep learning specialization covers learning from sequential data. This knowledge would be helpful in better understanding the tf specialization notebooks.