In week 1, we talk about using GANs for data augmentation. The described use-case is that we don’t have enough real data to train our classifier, so we train a GAN to generate some fake data.
But (as far as I understand), training a GAN is a much harder task than training a classifier, and GANs need very large datasets in order to begin to approximate the distribution of the data well. In this case, how realistic is it to use GANs for data augmentation? Any specific examples of use cases?
It’s quite nice that you are thinking like this. The question is something that really made me think a lot too.
Am going to try my best to answer and I might be wrong.
See, what a GAN does is try to predict the actual distribution that the training dataset comes from and thereafter picks out different data points from that distribution.
But, remember, our training dataset is not the entire distribution, its just a part of the ACTUAL DISTRIBUTION and we call it as a RANDOM SAMPLE.
In statistics, its kind off a thing that if your sample has more than 32 data points, you are kind off good. This is not a hard and fast rule but just something that goes on in the industry.
Now you might think what do I mean by “you are good if you have 32 data points?”. Well, it is that from the small random sample we can predict the entire actual distribution by some formulation.
This is something which plays on when we are doing Data Augmentation through GANs. We try to match the entire distribution from the sample distribution and proceed on giving out our own data points.
Hence, even though in cases where the number of data points(images) are small, it works and gives variety of images. Ofcourse they might not be that good when compared to a larger dataset, but they are quite acceptable indeed.
I hope I could help you. Please feel free to correct me too!
Greetings for your first post! You are bringing out a really interesting question.
Here are my thoughts:
For most classifiers, more training data is always better than less, which could help you avoid overfitting. Here, with more training data, we refer to at least thousands or even millions of images.
However, in real life, some data are hard or expensive to collect, such as data in the health industry and labelled data for speech emotion recognition. In this case, we can use GANs to generate fake data, as long as we have enough real data, GANs would be good enough to generate fake ones which are similar to real ones for augmentation. Researchers are constantly working on how to use fewer data to train GANs and achieve the same result as using a large amount of data. For example, Training GANs can require upwards of 100,000 images, but an approach called adaptive discriminator augmentation (ADA) detailed in the paper “Training Generative Adversarial Networks with Limited Data,” enables results with 10 to 20 times less data.
As mentioned by @sohonjit.ghosh, there are methods like Principal Component Analysis that can help us retain the important features of the data, without needing all the data. You can read more about PCA to understand this. His comment, pretty much says the same.
Hey @sohonjit.ghosh
Although 32 data points is enough for some of the statistical rules to be obeyed (like central limit theorem), we also must remember the fact that the decoder has too many parameters. So it can directly memorize the data samples.
Also, training with a good encoder might lead to problem of “cold-start”.
The paper mentioned by @fangyiyu is a good one and one can follow that!