Data augmentation questions

Based on the examples regarding classifying spam emails, and recognizing letters, I wanted your help with the following:

  1. Assuming our algorithm misclassifies a particular letter, or a particular type of email. Are the results much different between synthesizing and adding the particular types we have a problem with (to the dataset) and between synthesizing all types of spam emails and letters and adding them to the dataset?

Adding more data of one specific letter only for example, to combat the miscalculations of this letter in particular, would this skew the data ?

  1. When we notice that the model accuracy is dropping after deploying it perhaps (as the professor said after deploying the model we inevitably expose the model to more data), how often can we assume that this is happening because we have a change of regime in the field our algorithm is trying to make predictions to.

In other words, is it possible that the algorithm is dropping accuracy, but the new incoming data are more or less having same behavior as the ones we used to train the model with but the accuracy can still be dropping?

  1. If, we do assume that there has been a regime change and decide to synthesize data resembling the new incoming data, in order to improve the accuracy again, is it reasonable to think that since there has been a regime change, to drop chronologically old data. Completely remove a small portion of data, the oldest in our initial dataset (& obviously add our new data to our new set for retraining).

  2. Lets say, I have a large dataset of emails. And let’s assume that I have trained a “Just right!” model to classify spam emails for me. Later for reasons X & Y , after the model is created , I decide that the only thing I need is to make predictions wether an email is Pharma spam or not.

Is it even slightly possible that the model I have created , with all spam classes included performs better (by chance or because I handpicked them) when I only feed it with Pharma spam emails? Or perhaps I have to build a new NN to train it for that particular reason only (pharma spam or not) in order for me to get better results.

Hello @Kosmetsas, I am not an expert in spam classifier, so I am just sharing my two cents:

Before addressing your question, let’s consider this flow of thinking: “analysis shows pharma spam prediction is bad” → “more pharma training samples” → “validate model performance: worse / not worse” + “validate pharma spam prediction performance: good enough / not good enough”. I think at least we need to get “not worse” and “good enough” to conclude that we have made the needed improvement.

Now your question: I would not add other types immediately unless i see evidence that drives me in that way. For example, a “worse” from the above flow is an evidence.

If the real-world emails are the same as our dataset, then there is no reason for it to perform worse in the real-world, so logically speaking, if it performs worse in the real world, there must be some difference between our dataset and the real-world emails.

Such difference could be systematical (aka change of regime) or statistical (happens to see more extreme outliners). In either case, we need more pharma samples to address it.

Old samples doesn’t have to be bad. Recent samples can also be irrelvant. Maybe those spam things work periodocially so that when latest tricks become aware of by people, old tricks are reused.

However, I think the core of Q3 is related to Q4 in the sense that, how many different types of spam samples we want in training a model to combat a certain problem, so I will answer them altogether next.

This is difficult to say when different types of spam emails are not completely uncorrelated with each other. Because if there are something in common between the different types, anything at all, training a NN with all those different types could result in a better NN.

At the end of the day, it is the validation step to speak about what should be done and some human insights may help. My gut feeling is it’s going to be case-by-case and we can’t generally say that we always only need samples of the particular problem in our training set to comat the particular problem.

What do you think?


Thanks for your reply!

I am just guessing here:

  1. Regarding adding specific data augmented/synthetic, I think there should be a small impact the model. The area where this data will be added (geometrically) will be denser and thus the “let’s say” polyonym, will try to bend a little to capture them. There will be some impact later ( a small bend in the curve). If the polyonym is of relatively high curve and we have used proper regularization, I think the impact would be neglibigle and we should be fine.

  2. I think models fail because regimes change, at least in what I believe to be most of the times (let’s say 99%). I am quite curious, if someone has some particular experience, where new data seemed similar , but the algorithm dropped accuracy.

  3. I liked your explanation on why , we dont always have to drop old data. I will think deeply on that.

4)After training the model with many spam classes, I think one should be able to choose a subset of them (lets say pharma) and use the same model (trained previously) to predict only if it is pharma or not. Same neural network, but our examples would be only solid emails and pharma spams. My question was , if I handpick specific examples , is it possible that my pretrained NN will achieve a higher accuracy when compared to test that than include all of spam types? Theoretically , it shouldn’t since the weights remain the same. The prediction output should be the same. More importantly though, should I retrain the NN only for pharma spam, if I want better results or not? Correlation should play a part but I cannot understand how exactly.
Perhaps trial and error!

Again thanks for your time.

Some follow-up comments:

It depends on how you augment your data. Because your augmentation can be a complete shift (aka regime change) from before-augmented, and when it is a shift, it is not getting geometrically denser.

Yes, it’s your freedom to use the model however you want.


Possible. You can achieve 80% overall accuracy and 90% accuracy on just pharma spams.

I won’t draw a conclusion on this by arguments. As i said, there is a chance that other spam types can help you distinguish a pharma-spam from a non-pharma-spam. And this chance cannot be argued or analytically concluded.

I can do a lot of analysis, but at the end of the day, I will always need to do experiment to prove my ideas, by training a model for each of the different candidate dataset, evaluating the models, and picking the best choice. As you said, “trial and error”.


Any one has experience or comments?

Hey @Kosmetsas_Tilemahos,
I just wanted to mention something regarding your first question

Essentially, what you are asking is if there are any differences in performing class-specific data-augmentation and data-augmentation for the entire dataset, right?

Now, this is basically something that I just recently come across in the latest version of the Batch, which you can find here. Take a look at the topic “Tradeoffs for Higher Accuracy”. This research work shows that data augmentation though might increase the overall accuracy of your model, but might lead your model to perform worse on some classes.

I am not really sure, but I guess they have given examples of both the different types of augmentations, though not explicitly highlighted. Adding augmented images of zebras is I suppose a class-wise data augmentation, and in the results, they have mentioned some examples of data augmentation for the entire dataset. I hope this helps.