I would like to know what are some augmentation methods for non-image data, such as tabular data. I am taking both the ML and DL specialization courses and learned some strategies for image data augmentation. I would like to apply what I learned in these courses to my research. I am reading a paper in my field and noticed that the authors increased their training data size by the following method:
The initial data consisted of 12 samples (only) with 16 features. 1 and 0 were used to represent whether the sample contained a specific feature (e.g. ingredient). And for some features, the value is a range (e.g. 60-80%). The authors used a fluctuation range of 0.1 for each 0 and 1 vector and transformed 0 and 1 to the range of [0, 0.1] and [0.95, 1.05]. So all the features have ranges for their values. Then the authors created new samples using a random number generator in Jupyter Notebook in the range [0, 0.1] and [0.95, 1.05] from the original samples (12) and then use the same label (y) of the original sample for the artificially generated samples. As a result, their training data was increased from 12 to 120.
I am not sure if this is a good way to augment the training data. I think that the training set is subject to a big overfitting problem. Am I right?
If this is not a good or even valid method to increase training data, are there any other methods? After all, limited data size is a common problem in AI applications.
BTW, I learned from the DL courses that using transfer learning is one way to deal with small data set. Maybe some algorithms (like trees and SVM?) are better at small data sets than others (DNN and CNN?).
That seems like a questionable way to increase the size of the data set, without actually adding any information. Those additional (120-12 = 108 ) examples are entirely statistical. They don’t add meaningful data.
If there are fewer examples than features, overfitting is a big risk.
Those algorithms are statistical based and they should all suffer from small dataset problem, switching from one to another shouldn’t be a sure win. If you can’t increase your data size, you may want to pay more attention to regularization. Andrew discussed about regularization in this week’s The Batch.
I found the following paragraph by Andrew in this week’s The Batch very interesting:
“Thanks to research progress over the past decade, we now have more robust optimization algorithms like Adam, better neural network architectures, and more systematic guidance for default choices of many other hyperparameters, making it easier to get good results. I suspect that scaling up neural networks — these days, I don’t hesitate to train a 20 million-plus parameter network (like ResNet-50) even if I have only 100 training examples — has also made them more robust. In contrast, if you’re training a 1,000-parameter network on 100 examples, every parameter matters much more, so tuning needs to be done much more carefully.”
I am trying to fully interpret it - so with modern deep learning models that use robust optimization algorithms like Adam and systematic guidance for default choices of hyperparameters, we can train a 20 million-plus parameter network (like ResNet-50) with 100 image examples and still get good performance?
Are there any articles or resources that discuss small sample size issues? In any case, the sample size should be much larger than the number of features, correct (it makes statistical sense)? Is there a rule of thumb between sample size and feature number? What’s the minimum sample size to train different models, like linear and logistic regression, RF, SVM, and NN? Of course, in the end, we will need to use cross-validation and a separate test set to test the performance of a model. Do they also work for small size samples (say <30)?
I never tried to train a 20 million-plus parameter network with 100 image examples, and I do not have any articles that discuss small sample size issues. I think you have a lot of research work to do in order to get for yourself the answers you are asking for.
TLDR:
I think Dr. Ng is referring to the advances in Transfer learning with the 100 training examples sentence, but let’s break down my interpretation of his words.
So, I think from the context “I suspect that scaling up neural networks — these days, I don’t hesitate to train a 20 million-plus parameter network (like ResNet-50) even if I have only 100 training examples — has also made them more robust.”
We should break this sentence up into the two different thoughts:
“I suspect that scaling up neural networks … has also made them more robust.”
“… these days, I don’t hesitate to train a 20 million-plus parameter network (like ResNet-50) even if I have only 100 training examples …”
Now we should ask the question what does “scaling up neural networks” mean?
When I hear the topic of scale, I think of large batch sizes and in particular this popular paper that is the basis of this blog post from OpenAI:
I could be wrong, but this is how I interpreted “scaling up neural networks” (let me know if someone has a better understanding).
Now let’s look into training ResNet-50 with 100 training examples: I can’t imagine he means training the network from nothing with 100 examples as that directly contradicts the statement with scale. However, ResNet-50 is frequently paired with a dataset that has around a million images “ImageNet”, TensorFlow even has this built into the framework. Check the weights parameter here: tf.keras.applications.resnet50.ResNet50 | TensorFlow v2.9.1
So loading ResNet-50 with ImageNet weights and doing transfer learning with 100 examples seems to be the most reasonable interpretation of the statement.