I tried to fit the CoAtNet model, which combines Convolution (MBConv Block) and Relative Self-Attention mechanisms for image classification, with the cifar100 dataset, using the documentation and sample code. But no matter how much I played with the parameters, I could not get the value accuracy to exceed 50%-60%. I can estimate that these values should be above 90% for such a model (the same may not be true for hyperparameter values) from the fact that it reached 90.8% accuracy when trained with ImageNet21k + ImageNet1k datasets. Since the dataset and the classes to be predicted are quite small, I think there is hope for the model to get 100% val accuracy, but as I mentioned before, val accuracy starts to fluctuate or remain constant above 50%. In this process, the accuracy usually increases continuously, but I could not see a 100% value for accuracy. I thought that this might be due to overfitting and I tried to realize the following 3 things we should do when we encounter overfitting:
- Reduce model complexity.
I tried this by reducing the number of blocks or the number of block repetitions or the number of block units, but it is still the most important item that I am not sure if I am implementing correctly.
-
Try experimenting with drop rate hyperparameters for the last layer and drop connect rate hyperparameters for each block result. (Improved val accuracy but was very insufficient)
-
Apply more than one hyperparameter combinations.
I also tried Bayesian optimization algorithm for hyperparameter selection, but I would like to get your advice in applying these algorithms.
- Adding data augmentation layer.
I have trained the model many times with many combinations but I could not find an effective solution to increase the validation accuracy.
CoAtNet Architecture:
Hyperparameters:
Our model trained without Stochastic depth, Cosine Decay with warm up, Gradient Clip and EMA hyperparameters.
Model Evaluation in Paper:
Testing CoAtNet0
> Note that increasing the drop rate and drop connect rate increases the validation accuracy value of the model to a maximum of 60%-65% and remains at fluctuating and constant values like val accuracy in this graph.
The implementation(keras) model i use for all testing:
The CoAtNet Paper:
[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes (arxiv.org)