Data augmentation for tabular data

I have this small tabular dataset (~600 training examples/50 features) and I am trying to fit a binary classifier on this data (let’s say a neural network though neural networks are not that good for such a task). I have come up with the idea of augmenting the data to avoid potential overfitting problem since the dataset is very small. I have used some known algorithms, such as SMOTE and its variant and also, some good libraries such as synthcity (GitHub - vanderschaarlab/synthcity: A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.). My hope was that, like the case of data augmentation for images, the more data I have, the less the overfitting problem will likely happen. However, when I test the trained model on the test set, I get some really horrible results (like very serious overfitting problem): for instance, 100% accuracy on the train set, 50% on the test set (or even worse or better depending on how much synthetic data I create).
As a result, is it possible that data augmentation doesn’t work well on structured data such as tabular data? Any ideas/thoughts about that? I am looking forward to learning about such a topic, so any kind of help is appreciated.

What did you learn from lectures in courses 2 and 3 in deep learning specialization on methods to deal with high variance and how to figure out if train / val sets come from the the same distribution?

I am curious what support you have for this claim?

600 examples for 50 features isn’t necessarily a small data set, or in need of augmentation.

Using dropout or regularization are appropriate methods to address overfitting.

Augmenting a data set of images can be very effective, because you’re just modifying an existing example in a way that does not change its meaning (i.e. shifting, scaling, and rotating - it’s still an image of the same object).

But augmenting a set of tabular data by inventing new data, that’s much more problematic.

Please, check some of the papers listed here: Seminar: Deep Learning for Tabular Data – Machine Learning Lab
Also, this link may be useful: A Short Chronology Of Deep Learning For Tabular Data

There are many techniques to address overfitting such as: dropout, regularization, data augmentation, using simple models…
But, my question is specific about data augmentation on tabular datasets.
I tried it but performed very badly!
Also, regarding: “But augmenting a set of tabular data by inventing new data, that’s much more problematic”, the new generated data is created from the training data (e.g SMOTE and its variants).
The purpose of this discussion is educational: if there are some papers, blogs … about the effect of data augmentation for tabular datasets, that would be appreciated.

I question the validity of the sources you cite.

For example, this statement:
Most tabular datasets already represent (typically manually) extracted features,
… is entirely false.

I suggest this is because it’s a bad idea.